The standard deviation is a so-called dispersion measure, which makes a statement about how far the data points in a data set are from the mean value. In practice, the Greek letter σ (sigma) is used as a symbol.
What is the Standard Deviation?
In statistics, there are various characteristic values that describe a data set or a distribution of values more precisely. Often, for example, the expected value is used for this purpose, which in the case of a probability distribution indicates the value that is most likely to occur.
\(\) \[E(X) = x_1 \cdot P(X = x_1) + x_2 \cdot P(X = x_2) + … + x_n \cdot P(X = x_n)\]
However, the value alone is not sufficient to provide detailed information about a data set. Suppose we want to compare two school classes that have written the same exam and, after grading, have achieved the same grade point average, i.e., the same expected value, of 2.5. Would we now assume that the students of both classes have approximately the same knowledge?
Probably only if the students in the two classes have achieved similar grades. In class A, however, the average of 2.5 comes from the fact that some strong students wrote a 1.0, while other, weaker students could only achieve a 4.0 on the exam. In Class B, on the other hand, students were much closer together, mainly scoring 2s and 3s. In contrast, there were no outliers up or down at all.
In statistics, these characteristic values are called the measure of dispersion. We look at how far the individual values, in our case the students, are from the expected value, i.e. the grade point average. Two data sets can have the same expected value, but very different measures of dispersion.
What is variance and how do you calculate it?
The variance is a measure of the dispersion from statistics. It calculates the sum of the average deviation of the data sets from the mean value and squares this difference. By squaring, positive and negative deviations from the mean are included and cannot cancel each other out. In addition, squaring makes large deviations much more significant than small ones.
\(\) \[\sigma^2 = \sum_{i=1}^{n}(x_{i} – E(X)) \cdot p_{i}\]
If you have been paying attention up to this point, you will notice that the variance does not have its own symbol or Greek letter, but is denoted by σ^2. As we said before, σ stands for the standard deviation. Thus, the variance is the squared standard deviation.
How to calculate the standard deviation?
Now that we already know the relationship between variance and standard deviation, the associated formula is fairly simple to set up, since it is simply the root of the variance:
\(\) \[\sigma = \sqrt{\sigma^2} = \sqrt{\sum_{i=1}^{n}(x_{i} – E(X)) \cdot p_{i}}\]
How to interpret the value?
As we have already explained, with variance it makes perfect sense to square the difference between the data point and the expected value. However, this also makes the variance much more difficult to interpret, as it is not really practical.
With the standard deviation, on the other hand, it’s different, because here we take the root again and are thus back in the original unit. For our exam example, a standard deviation of 1.2 would mean that the class achieves an average grade that is 1.2 above or below the grade point average of 2.5. This value, therefore, opens up an interval of 1.3 to 3.7, since the direction of the deviation is not specified.
Thus, a lower standard deviation generally means that the data set is relatively close to the expected value and that the individual data sets deviate from it by very little.
When do you use the standard deviation for the population and when for the sample?
In some literature, two different standard deviations are distinguished, namely for the population, which is then described by σ, and that for the sample, which is denoted by s. The two terms differ in the underlying quantity studied:
- The sample contains individual elements of all objects (e.g. society) from which data are collected in a study. These can then be used for statistical analysis.
- The population is the summary of all units of investigation. For this group, one wants to be able to make statements with the help of statistical analysis.
In statistics, it is actually not possible or simply not practicable to survey the entire population. Therefore, an attempt is made to find a study unit that is as representative as possible and that allows generalization to the population.
In the formula for the standard deviation, the two variants differ only in that for the population one divides by the size of the sample, and for the standard deviation of the sample one divides only by the size of the sample – 1.
What is the empirical rule of normal distribution?
The empirical rule helps to interpret the normal distribution and is also known as the 68-95-99.7 rule. It states that:
- Within one standard deviation of the mean, 68% of the data points can be found in a normally distributed data set.
- Within two standard deviations of the mean, about 95% of the data points can be found in a normally distributed data set.
- Within three standard deviations of the mean, about 99.7% of the data points can be found in a normally distributed data set.
This rule is an important aid when interpreting data sets that follow a normal distribution or are assumed to follow a normal distribution. If the mean and standard deviation are then known, important conclusions can be drawn from the data set that are sufficiently accurate.
However, the empirical rule also has its pitfalls, as it requires the existence of a normal distribution and cannot be used otherwise. In addition, it cannot be used for categorical or discrete data, as these are not normally distributed, although this is the case in some applications. Despite all this, the empirical rule is an important pillar in today’s statistics.
What is often misunderstood about the standard deviation?
When working with the standard deviation, many people are often misled by the pure value and make incorrect predictions based on this key figure. The following misunderstandings are common:
- Truthfulness: The standard deviation says nothing about the accuracy or precision of the data. A small deviation does not indicate accurate data any more than a large value indicates incorrect or unreliable data.
- This parameter can not only be used in the evaluation of normally distributed data, but can also be applied to other data distributions.
- In addition, there are often misunderstandings about the standard deviation of the population and the sample. It should be noted when the respective key figures are to be used and, above all, that they are different.
Such misunderstandings can occur more quickly than expected. It is therefore important to observe the following rules in order to avoid these misinterpretations:
- A low standard deviation only indicates that the data points are very close to the mean value, which does not allow any conclusions to be drawn about the accuracy of the data. Body size, for example, has a low standard deviation, but this does not mean that every data set on body size is sufficiently accurate.
- Before using the standard deviation, the underlying distribution of the data should be checked to ensure that the metric is being used correctly. Furthermore, just because there is no normal distribution does not automatically mean that the standard deviation should be ignored.
- When working with a sample, the sample standard deviation should also be used to estimate the corresponding population measure.
If these misconceptions are kept in mind when working with data sets, many simple errors in the interpretation of data can be easily avoided and it is ensured that the standard deviation has been used correctly. Conclusions based on this can provide important insights into the population.
How are standard deviation, hypothesis tests and confidence intervals related?
The standard deviation is used, for example, in so-called hypothesis tests. This statistical method uses a data set to check whether a hypothesis is statistically significant so that findings from the data set can be transferred to a population. This is where the so-called confidence intervals come into play, which are used to find the value range in which the population parameter is likely to lie.
Various key figures are used in hypothesis testing, which can be calculated using the standard deviation, among other things. On the one hand, there is the so-called test statistic, which consists of the difference between the sample mean and the hypothetical population mean. This difference is divided by the standard error of the mean. This in turn is the standard deviation of the sample distribution. This test statistic is then compared with a critical value to determine whether the hypothesis can be confirmed or not.
For this purpose, the so-called confidence intervals are formed, which are calculated using the sample mean and the deviation value. The confidence interval comprises all values that lie around the sample mean plus or minus a margin of error. This margin is obtained by multiplying the standard error by a critical value based on the desired confidence level. The standard deviation is decisive for the width of the confidence interval.
Using this method, analysts can provide reliable statements based on data sets that provide an insight into the overall population and its correlations. The standard deviation is crucial in order to gain a better understanding of the variability of the data and thus make more accurate predictions and conclusions.
This is what you should take with you
- The standard deviation is a so-called measure of the dispersion from statistics.
- It provides information about how far away the individual data points are from the expected value on average. A low standard deviation indicates that the data points are relatively close to the expected value and vice versa.
- The standard deviation is closely related to the variance, as it is simply the square root of the variance.
What is Gibbs Sampling?
Explore Gibbs sampling: Learn its applications, implementation, and how it's used in real-world data analysis.
What is a Bias?
Unveiling Bias: Exploring its Impact and Mitigating Measures. Understand, recognize, and address bias in this insightful guide.
What is the Variance?
Explore variance's role in statistics and data analysis. Understand how it measures data dispersion.
What is the Kullback-Leibler Divergence?
Explore Kullback-Leibler Divergence, a vital metric in information theory and machine learning, and its applications.
What is the Maximum Likelihood Estimation?
Unlocking insights: Understand Maximum Likelihood Estimation (MLE), a potent statistical tool for parameter estimation and data modeling.
What is the Variance Inflation Factor (VIF)?
Learn how Variance Inflation Factor (VIF) detects multicollinearity in regression models for better data analysis.
Other Articles on the Topic of Standard Deviation
Statista offers a detailed article on the topic.