Normal Distribution – easily explained!

The normal distribution, or Gaussian distribution, is the most important continuous probability distribution since almost all values we have in our environment are normally distributed. Body height (within a gender), the 100m times of a swimmer in different races but also something as special as the weight of several coffee packets follow the Gaussian distribution from a sufficiently large sample.

If we perform a random experiment, such as measuring the times of a swimmer again and again, then we want to obtain a so-called density function. This tells us how often a certain event occurs. For example, we might be interested in how likely it is that the swimmer completes the 100m in a time of 1:15 min. Additionally, we might be interested in the probability that the athlete swims the 100m in under or at most 1:15 min. We can answer this question with the help of the distribution function. The distribution function indicates the probability with which the result of the random experiment is less than or equal to a certain value.

What is the definition of the Normal Distribution?

A continuous random variable X with a density function f(x) of the form

 $f(x) = \frac{1}{\sigma \sqrt{2 \pi}} \cdot e^{-\frac{1}{2} \cdot \frac{(x – \mu)^2}{\sigma}}$

with the expected value µ and the variance σ² is called normally distributed (short: N(µ, σ²)). The expected value µ…

• … is a real number, so it can also become negative.
• … is the X-coordinate of the maximum of density function.

The variance σ²…

• … is the squared standard deviation σ.
• … must always be greater than 0.
• … determines how much the graph is stretched or compressed horizontally. Low variance means that the graph is narrow.

What is the Density function?

In connection with the normal distribution, the density function is usually shown with its well-known bell curve. In short, this graph is used to read off the probability of this event occurring for an expected value X. The probability of this event occurring is determined by the probability of this event occurring.

The graph depicts the normal distribution of heights in centimeters measured in male subjects. The expected value µ = 180 indicates that the majority of the subjects were 180cm tall. The variance σ² in this example is 7. The probability for the expected value X = 176 is about 5%, i.e. a random male test subject is exactly 176cm tall with a probability of 5%.

What is the distribution function?

The distribution function F(x) of the normal distribution is defined by

 $f(x) = \frac{1}{\sigma \sqrt{2 \pi}} \cdot \int_{- \infty}^{x} e^{-\frac{1}{2} \cdot \frac{(x – \mu)^2}{\sigma}}$

Thus, the integral of the density function f(x) in the range from – to the random variable X. Accordingly, the distribution function indicates how high the probability is that the random variable takes on a value of less than or equal to X:

 $f(x) = Prob(X \leq x)$

For the expected value X = 176, we obtain a probability of about 6.7% in the distribution function. A random, male person is thus shorter or exactly 176cm tall with a probability of 6.7%.

What is the empirical rule of normal distribution?

The empirical rule, also known as the 68-95-99.7 rule, is a statistical guideline for normal distribution. It states that:

This rule can be helpful in interpreting and understanding data that follows a normal distribution. For example, if we know that a data set is normally distributed and we calculate its mean and standard deviation, we can use the empirical rule to estimate the proportion of the data that falls within certain ranges.

It is important to note that the empirical rule is only an approximation and does not apply to all normal distributions. Also, it only applies to continuous data that follow a normal distribution, and not to categorical or discrete data. Nevertheless, the empirical rule can be a useful tool for gaining insight into normally distributed data.

What are the alternatives to the normal distribution?

For many applications, the normal distribution is very suitable for modeling continuous random variables in statistics. In some situations, however, other distributions are more suitable. The best known alternatives are

• The binomial distribution is used to model situations with a binary result, for example heads or tails. For this purpose, the number of successes is measured for a fixed number of attempts.
• The Poisson distribution is used to model events that occur in a fixed time or space interval. It is important to note that two events cannot occur simultaneously and that the individual events are independent of each other.
• The exponential distribution is closely related to the Poisson distribution and models the time between two consecutive events. In practice, this can be, for example, the duration between two earthquakes or the time between two customers entering a store.
• The gamma distribution comprises various distributions, including the exponential distribution as a special case. These distributions are used to model the times between two events in general.
• The beta distribution is used to simulate a situation in which the probabilities lie within a limited range. For example, the proportion of voters who vote for a particular candidate.
• The Weibull distribution describes the time until a system fails. The failure rate increases or decreases over time, just as the probability of a tire blowout increases with more kilometers driven, for example.
• The uniform distribution is used to map random variables with constant density functions over a specific, finite range.

Each application should use a suitable probability distribution based on the type of data and the particular research question. The results depend largely on the correct choice of distribution. Therefore, in the next chapter we will deal with the choice of the appropriate probability distribution.

How to choose the appropriate data distribution?

When working with data, it is crucial to choose the appropriate distribution for the given dataset. Selecting the wrong distribution can lead to incorrect assumptions about the data and affect the results of any analysis or modeling performed.

One approach to choosing the correct distribution is to examine the characteristics of the data. For instance, if the data has a single peak or mode, it may be appropriate to assume a normal distribution. Alternatively, if the data is positively skewed, it may be appropriate to assume a log-normal or gamma distribution. On the other hand, if the data is negatively skewed, it may be appropriate to assume an inverse gamma or Weibull distribution.

Another approach is to use statistical tests to compare the fit of different distributions to the data. Some commonly used tests include the Kolmogorov-Smirnov test, the Anderson-Darling test, and the Chi-Squared test. These tests can help to determine which distribution provides the best fit for the data.

It is also important to consider the context of the analysis or modeling. For example, if the data represents a count of discrete events, it may be appropriate to assume a Poisson or negative binomial distribution. If the data represents a proportion, a beta distribution may be more appropriate.

Ultimately, choosing the correct distribution for data requires careful consideration and understanding of the data and the context in which it will be used.

What are hypothesis tests and how do they use the normal distribution?

In statistics, it is often impractical to survey the entire group of objects, the so-called population, to conclude. This is usually not possible due to time or monetary restrictions. Therefore, an attempt is made to form a sample that is as representative as possible, which can be used to determine whether a hypothesis that is observed in the sample also applies to the population.

As many natural phenomena, such as the height of people, are normally distributed, hypothesis tests often use a normal distribution to simulate the population. For hypothesis testing, a so-called null hypothesis is then formulated that needs to be tested, such as the assertion that a maximum of two percent of all parts in a production run are defective. In order to determine whether the null hypothesis is correct or not, the test statistic, which is dependent on the assumed distribution, is then compared with a critical value. This critical value depends on the so-called significance level, i.e. simply put, how certain you want to be about the result.

If the test statistic is greater than the critical value, the null hypothesis is rejected and cannot be transferred to the population. In our case, this would mean that no more than two percent of the production parts are faulty, which would be tantamount to saying that more than two percent of the parts are faulty.

The normal distribution is used in hypothesis tests if it is assumed that the data distribution is normally distributed. The assumption that the hypothesis is normally distributed in the population enables us to conclude the distribution in the sample to the distribution in the population. This allows, for example, the standard deviation of the population to be estimated and the probability of a particular event or range of values to be predicted.

However, it is important to note that not all phenomena are normally distributed. If the data is not normally distributed, we may need to use a different distribution in hypothesis testing. There are many different probability distributions, each with their properties and applications. Choosing the right distribution for a particular data set requires careful consideration of the nature of the data and the hypothesis to be tested.

This is what you should take with you

• The normal distribution is a fundamental concept in statistics and probability theory.
• It is widely used to model various phenomena in the natural and social sciences.
• The empirical rule provides a useful guideline for understanding the distribution of data.
• While the normal distribution is a common and useful model, it is important to consider alternative distributions when appropriate.
• Choosing the correct distribution of data is essential for accurate statistical analysis.
• Hypothesis testing is a powerful tool that relies on normal distribution to make inferences about population parameters.
• Understanding the normal distribution and its properties is an important foundation for further study in statistics and data analysis.

What is the Variance?

Explore variance's role in statistics and data analysis. Understand how it measures data dispersion.

What is the Kullback-Leibler Divergence?

Explore Kullback-Leibler Divergence, a vital metric in information theory and machine learning, and its applications.

What is the Maximum Likelihood Estimation?

Unlocking insights: Understand Maximum Likelihood Estimation (MLE), a potent statistical tool for parameter estimation and data modeling.

What is the Variance Inflation Factor (VIF)?

Learn how Variance Inflation Factor (VIF) detects multicollinearity in regression models for better data analysis.

What is the Dummy Variable Trap?

Escape the Dummy Variable Trap: Learn About Dummy Variables, Their Purpose, the Trap's Consequences, and how to detect it.

What is the R-squared?

Introduction to R-Squared: Learn its Significance, Calculation, Limitations, and Practical Use in Regression Analysis.

• You can find a concise summary of the topic here.