The median plays a central role in statistics and data analysis, as it can be used as a measure of the central tendency of a data set. It is that value in the data set that forms the center and divides the numbers into two equal parts. Compared to the mean value, it is not susceptible to outliers and provides a reliable picture of the central situation even with unevenly distributed data. Due to this property, the ratio is used in various fields, such as medicine, social sciences or economics.
In this article, we explain all the details about the median and dive into the calculation of this key figure using various examples. We also compare the median with other statistical indicators such as the mean or the mode and show which applications use the median. A complete picture also includes explaining the advantages and disadvantages of this key figure in detail so that you can make an informed decision as to whether the use of the median is justified. Finally, we look at how to calculate the ratio in Python or Excel.
What is the Median?
The median is a statistical indicator that provides information about the central tendency of the data set. It is the value that lies exactly in the middle of an ordered data series so that all elements in one half of the data set are smaller than the median and all elements in the other half of the data set are larger than the median. This property makes the median a robust choice, as it remains identical even with unevenly distributed data.
The key figure does not change if the specific values of the lower half change, as long as the number of values remains the same and they are all still smaller than the median value. This means that even extreme outliers do not influence this key figure for the time being.
The mean value, on the other hand, calculates the arithmetic average of the data set instead of finding the middle of the data set. This means that data points with extremely large or small values have a strong influence on the average. The median is used in a wide variety of applications where outliers may occur to obtain a more realistic assessment of the central situation. In income surveys, for example, the average and the median sometimes differ very significantly.
How is the Median calculated?
The median is the middle value in a data series and, depending on the size of the data series, can be a value from the set or a value that does not occur directly in the set. In the first step of the calculation, the data series is sorted in ascending order of size. Once this has been done, the further calculation depends on whether the number of elements is even or odd.
Data sets with an odd number of elements
With an odd number of elements in the data series, the calculation is much simpler. To do this, simply take the value from the sorted data series that lies exactly in the middle. The fact that the number is odd also ensures that such a value exists.
In the data series [3,7,9], for example, the 7 lies exactly in the middle, as there is exactly one value that is larger and exactly one smaller value. The median of this data series is therefore 7. The data series [3,7,9,13,18] has five elements and therefore also an odd size. In this case, the 9 lies exactly in the middle of the data series and the median is therefore 9.
Data sets with an even number of elements
With an even number of elements, the approach just described cannot be used because no data point lies exactly in the middle of the data series. In the data series [3,7,9,11], for example, this center point does not exist. Therefore, you have to use the two middle values, in this case, 7 and 9, and calculate the average, i.e. (7+9)/2 = 8. The median in this data series is therefore 8.
This procedure makes it clear why the median is robust against outliers, as the values outside the middle are not relevant for the calculation as long as they do not change the order of the individual elements. For example, if we use the series [3,7,9], 7 remains the median, even if the value of the other two elements changes significantly. This means that 7 is also the median of the data series [1,7,100] and it is also the median of the data series [0.001,7,10000].
What are Median, Mode, and Mean and how do they differ?
The median, mode and mean are different measures that describe other aspects of the “middle” of a data set. Depending on the type and characteristics of the data, each of these measures has its advantages and disadvantages, which we examine in more detail in this section. A central aspect for the selection of the appropriate measure is also the distribution of the data, which includes, for example, whether outliers are present.
Mean Value
The mean value is calculated by adding up all the data points in a data set and then dividing them by the number of data points. It thus serves as a kind of balance point for the data set. However, this calculation also makes the mean value susceptible to outliers, i.e. particularly high or low values, and to uneven distributions. In these cases, the mean value may not provide an accurate picture of the data set.
Since the mean value includes all values with equal weight in the calculation, a single extremely high or low value can strongly influence the average. Such a value can therefore result in the midpoint no longer representing the typical center of the data set. This makes the mean value unsuitable for asymmetrical or distorted data.
- Example: Suppose we have a data set on the distribution of income in a small group. The following incomes are determined: [30.000, 32.000, 35.000, 500.000]. From looking at the data, we would assume purely by feeling that an “average person” in this data set earns between 32,000 and 35,000. However, due to the very strong outlier, we get a calculated average of (30,000 + 32,000 + 35,000 + 500,000) / 4 = 149,250. However, this value is much higher than the typical incomes we observe in the dataset, which can be attributed to the highest income of 500,000.
Due to these properties, the mean value is particularly suitable for normally distributed data without extreme outliers, such as those found in chemistry, physics, or financial analysis. These can often fall back on a uniform distribution of data.
Median
The median is the middle value of a data series and divides the data set so that half of the data lies above the median and the other half below. This property makes the median significantly less sensitive to outliers or strongly deviating data points, as only the position of the extreme data points plays a role for it and not their absolute level.
For the income example above, this means that the median must lie exactly between 32,000 and 35,000, as this is where the middle of the data set lies. This results in a median of (32,000 + 35,000) / 2 = 33,500. This value provides a more realistic statement about the middle of the data distribution compared to the mean value, as it is not distorted by the outlier income.
Due to this property, this key figure is primarily used in applications in which data is often distorted and outliers play a major role, such as in income statistics, real estate prices, or medical data.
Mode
The mode differs from the two indicators presented so far and describes the most frequent value in a data series. It is therefore a measure of central tendency that deals exclusively with the frequency of values. This property means that the mode is the only one that can also be used for nominal data, i.e. data that is not numerical.
The mode is often used in surveys and market research, as it offers the possibility of identifying the most frequent answers or popular products. However, this also highlights the main problem with the mode, namely when a series is ambiguous so that two values occur the same in the survey. For example, if both “Audi” and “BMW” are mentioned equally frequently in a survey of the most popular car brands, then the data series is said to be bimodal, as exactly two categories occur most frequently. In a multimodal data series, there are even more than two values that were mentioned most frequently in the survey.
Compared to the mean and the mode, the median offers a balanced way of numerically expressing the central tendency of a data set. It is particularly strong in the case of distorted data with outliers, as it cannot be influenced by these as long as the mean value of the data set remains the same. The mean, on the other hand, is more suitable for normally distributed data without the risk of outliers, and the mode can also be used for nominal data sets where the median cannot be used.
What are the Advantages and Disadvantages of the Median?
The median is a commonly used measure to determine the central tendency of a data set. In this article, we have already mentioned some of the advantages and disadvantages of using this key figure. In this section, we want to summarize the points again and also include a few new aspects.
Advantages
- Robustness against outliers: As already explained several times, the main advantage of the median is that it changes little or not at all even in the case of extreme values and outliers, and, in contrast to the mean value, is significantly more robust against these phenomena. As it relates solely to the position of the data, the level of the data and its differences only play a subordinate role.
- Better representation with skewed distributions: In the case of skewed distributions, such as income statistics or real estate prices, this ratio often presents a more realistic picture compared to the mean. Such data is often skewed to the right, causing the mean to be pulled upwards unnaturally, making the median a better center of the data.
- Easy interpretation: The median can be interpreted quickly and easily, even with large data sets, and can therefore be easily understood by a non-specialist audience. The simple calculation method makes it much less complicated, which results in greater acceptance by the audience in reports or presentations, for example.
- Use with ordinal data: Finally, unlike the average, this key figure can also be used with ordinal data with rating scales or rankings, making it much more flexible to use. In market research, for example, this is a further advantage that makes it possible for the data to be neither measurable nor interval-scaled.
Disadvantages:
- Loss of detail: Since the median is only based on the position of the data and does not take into account the distances between the data points, this key figure loses important information about the position and distribution of the data. However, in distributions in which the distances play an important role, such as with measured values in scientific experiments, this is an exclusion criterion.
- Sensitivity to data changes in even data sets: For even data sets, this metric is calculated as the average of the two middle data points. As a result, it can be susceptible to data changes in such a scenario, especially if the sample is particularly small.
- Limited significance with normally distributed data: With normally distributed or symmetrical data, the median hardly differs from the mean, as outliers only play a minor role. As both values are then almost identical, the median offers no advantages and may even provide less information about the distribution itself. The mean, on the other hand, together with the standard deviation, can already describe important properties of the distribution.
- Lack of applicability in statistical tests: Many statistical tests, such as the t-test or the analysis of variance, rely on the mean value as a measure of the central tendency. The median is often not applicable here and would first have to be transformed at great expense to be applicable for these inferential statistical tests.
Just like other statistical indicators, the median has advantages and disadvantages that should be weighed up before use. The main advantages are that it does not react as strongly to outliers as the mean and can therefore provide a better picture of the central tendency. However, in normally distributed data distributions, the median hardly differs from the mean and information on the distances between the data points is still lost.
How can you calculate the median in Python and Excel?
Calculating the median is an important step in many statistical analyses and can be done quickly and easily using Python or Excel. In this section, we take a closer look at the calculation using simple examples.
Calculation in Excel
In Excel, the median can be calculated easily using the “MEDIAN” function. For example, you can collect the incomes in a data set in a column and apply the function in a new cell by defining the range in which the figures are stored.
In our case, we have stored ten income data in column A and overwritten it with the heading “Incomes”. Accordingly, the numerical values are stored in cells A2 to A11.
The median can now be calculated in a new cell by calling the function, using the equals sign and the MEDIAN function.
In the round brackets, we define the number range in which the income is stored and receive the final result after pressing the ENTER key.
With the help of these simple steps, the median of a data series can be calculated quickly in Excel.
Calculation in Python
In Python, you can use various libraries to calculate the median, such as NumPy or Pandas. In NumPy, you store the data in a list and then use the np.median
function, to which the list is then passed.
The use of NumPy can be particularly efficient with large amounts of data, as the calculation has been specially optimized. However, if the data is stored in a DataFrame, the use of Pandas is the obvious choice. This library already provides the .median()
function, which can be applied directly to a DataFrame column.
With the help of these tools, the median can be calculated quickly and easily.
What you should take with you
- The median is a basic key figure that provides a statement about the central tendency of the data set.
- It determines a value that is exactly in the middle of the data series and is therefore not so strongly influenced by outliers compared to the mean value.
- In addition to the median, the mean or the mode can also be used to determine the central tendency of a data set.
- The median also has disadvantages, such as the fact that information about the distances between the data points is lost.
- Various computer programs can be used to calculate the median, such as the Python programming language or Excel.
What is Gibbs Sampling?
Explore Gibbs sampling: Learn its applications, implementation, and how it's used in real-world data analysis.
What is a Bias?
Unveiling Bias: Exploring its Impact and Mitigating Measures. Understand, recognize, and address bias in this insightful guide.
What is the Variance?
Explore variance's role in statistics and data analysis. Understand how it measures data dispersion.
What is the Kullback-Leibler Divergence?
Explore Kullback-Leibler Divergence, a vital metric in information theory and machine learning, and its applications.
What is the Maximum Likelihood Estimation?
Unlocking insights: Understand Maximum Likelihood Estimation (MLE), a potent statistical tool for parameter estimation and data modeling.
What is the Variance Inflation Factor (VIF)?
Learn how Variance Inflation Factor (VIF) detects multicollinearity in regression models for better data analysis.
Other Articles on the Topic of Median
This link will get you to my Deepnote App where you can find all the code that I used in this article and can run it yourself.
Niklas Lang
I have been working as a machine learning engineer and software developer since 2020 and am passionate about the world of data, algorithms and software development. In addition to my work in the field, I teach at several German universities, including the IU International University of Applied Sciences and the Baden-Württemberg Cooperative State University, in the fields of data science, mathematics and business analytics.
My goal is to present complex topics such as statistics and machine learning in a way that makes them not only understandable, but also exciting and tangible. I combine practical experience from industry with sound theoretical foundations to prepare my students in the best possible way for the challenges of the data world.