Population and Sample - simply explained!

The samples are individual elements of all objects (e.g. society) from which data are collected in a study. These can then be used for statistical analysis.

The population is the summary of all units under investigation. The aim of statistical analysis is to be able to make statements about this group.

These are used to conduct scientific experiments and determine if there is a statistical relationship between several variables (Correlation and Causation).

Das Bild zeigt mehrere Menschengruppen. Die größte ist die gesamte Population und die kleinere die Stichprobe. — Population and Sample | Source: Author

A brief example: On the evening of the Bundestag election (election of the German parliament), the first projection with results is shown punctually at 6 p.m.. Since the polling stations do not close until this time, only a fraction of all votes cast can be counted, the sample. The purpose of the extrapolation is to make an accurate statistical statement about what the result will be for all votes cast, the basic population. As the evening progresses and more ballots are counted, the extrapolation also approaches the actual later election result and reflects reality more and more accurately.

What are the types of the population?

In statistics, a distinction is made between three types of population based on the number of elements and the actual countability of this population.

Finite population: The finite population comprises a finite number of members, which can therefore be measured within integers. A finite population represents, for example, the workforce of a company or the total number of households in an area or country. The majority of the populations studied can be represented by a finite population.
Infinite population: The infinite population, on the other hand, contains an infinite number of members. It is therefore not possible to examine the entire population. This group includes, for example, the number of all possible coin tosses or the number of bacteria in a certain environment, neither of which can be represented finitely.
Theoretical population: The theoretical population comprises a group of people, animals or objects that are considered for a statistical study and are theoretically finite, but this population simply cannot be determined. For example, when trying to make a statement about all people who have ever lived on planet Earth. In the same way, however, the total number of people with a certain genetic characteristic is also a theoretical population, as it is simply not possible to genetically examine every human being.

Knowledge of these types of populations is essential for the selection of a suitable sampling method and, above all, for correct statistical conclusions. Without this knowledge, incorrect generalizations can be made.

Population vs. Sample Examples

Research Question	Population	Sample
How much money does a German citizen spend on food per month?	All German citizens (over 18 years)	10,000 randomly encountered supermarket visitors
How old is the average student at the University of Stuttgart?	All students enrolled at the University of Stuttgart	Survey of students visiting Stuttgart University Library on a Saturday
How long is a song on the streaming platform Spotify?	All songs uploaded to the platform at the time, exclusive podcasts	100,000 randomly selected songs available in Germany

Practical examples for population and sample

4 Reasons for using samples instead of population

Practicability: It is easier and more feasible to collect data only from the sample, rather than the entire population.
Resource efficiency: The study saves costs for the survey, for example, through less time spent by the researchers or lower logistical costs, such as travel costs.
Necessity: Depending on the research question, it may also be nearly impossible to study the entire population. For example, the U.S. only conducts a complete census every 10 years. Due to the lack of mandatory reporting in the states, this represents such a large expense that it can only be taken once a decade.
Simpler data management: Due to the smaller number of people surveyed, less data is generated overall. Thus, there are lower costs for storing and processing the data. In addition, the calculations can also be performed much more quickly and easily.

Sampling Methods

To obtain a sample of a population, two types of sampling are distinguished:

Probability sampling is characterized by the fact that each element of a population has an equal chance of being part of the sample. For a population of 100 people, for example, this means that each person has a 1 in 100 (= 1%) chance of becoming part of the unit of study. These methods are usually very costly and time-consuming.

Non-probability sampling is the exact opposite. In this case, not all elements of the population have the same probability of becoming part of the study. An example of this would be if the University of Stuttgart wanted to evaluate all German students, but only surveyed students from its university for the study. This saves the research team the time and expense of interviewing and studying students outside of Stuttgart.

In addition to this very general subdivision, more detailed sampling methods can also be found:

Stratified random sampling: Here, the population is divided into subgroups that are formed depending on certain characteristics, such as age or gender. A sample is then formed from each of these subgroups, the size of which depends on the ratio of the size of the subgroup in the population. This procedure ensures that the overall sample is also representative of the population.
Cluster samples: The cluster sample creates clusters from the population. These can be regional, for example, such as cities or districts. A random selection is then taken from each cluster. This method can be more efficient than a random selection of the entire population if the clusters are as homogeneous as possible. Heterogeneous clusters, on the other hand, leads to less efficient results than a random sample of the entire population.
Systematic random sample: In this method, the members of the method are sorted according to a specific characteristic and then the nth member is always included in the examination unit. With a large population and a simple characteristic for sorting, it can lead to a more efficient random sample.
Random sampling: This classic method is used to create quick and inexpensive samples. It involves selecting people who are readily available or easy to reach. A survey of selected visitors to a weekly market is an example of a random sample. However, this method can lead to serious distortions if the selection is not representative of the population.

The choice of the appropriate sampling method depends on various factors, such as the research question, the characteristics of the population, the resources available, and the desired level of precision and accuracy. It is important to consider these factors carefully

How to find the right size for the study unit?

Before starting the statistical analysis and collection of data, it should be determined how large the selection size should ideally be. This value depends on several influencing factors. One of the most important factors here is the size of the population itself. If the population is larger, the study unit should also be correspondingly larger. The sampling method also affects the required sample size. In a random sample, for example, as many members as possible should be part of the sample to prevent bias.

In addition, a certain buffer should always be planned for the size of the sample, especially for longer-term studies, as problems may arise in the course of the experiment that requires members to be left out of the sample, which reduces the sample size.

The desired degree of precision is another characteristic that influences the sample size. If a higher degree of precision is to be achieved, more members must be included in the sample. The desired confidence interval of the hypothesis, for example, also plays an important role here. The variability of the characteristics in the population also plays an important role. Greater variability requires a larger sample size.

Das Diagramm zeigt die Glockenkurve mit dem Erwartungswert (Expected Value) in Orange in der Mitte der Kurve. — Confidence interval for a normal distribution | Source: Author

Finally, the available resources should also be taken into account in order to determine the size of the sample. In many cases, the time and cost budget of the study limits the size of the sample.

It is therefore clear that many different factors have an influence on the size of the sample and should therefore be taken into account. The most important points here include the size of the population, the sampling method, the desired degree of accuracy and the available budget. There are also some formulas and software tools that can help calculate a suitable sample size based on these characteristics.

This is what you should take with you

The samples are individual elements of all objects from which data are collected in an investigation.
The population is the summary of all units of study.
The use of samples is preferable to the use of the entire population for various reasons, such as practicality or resource efficiency.
Samples can be collected either by random sampling or by non-random sampling. The difference is that in random sampling, all elements of the population have the same probability of appearing in the sample. In the non-random sample, this is not the case.

What is Gibbs Sampling?

5. October 2024

Explore Gibbs sampling: Learn its applications, implementation, and how it's used in real-world data analysis.

What is a Bias?

27. July 2024

Unveiling Bias: Exploring its Impact and Mitigating Measures. Understand, recognize, and address bias in this insightful guide.

What is the Variance?

13. July 2024

Explore variance's role in statistics and data analysis. Understand how it measures data dispersion.

Kullback-Leibler Divergence / Kullback-Leibler Divergenz / KL Divergence

What is the Kullback-Leibler Divergence?

3. July 2024

Explore Kullback-Leibler Divergence, a vital metric in information theory and machine learning, and its applications.

Maximum Likelihood Estimation / MLE / Maximum Likelihood Methode

What is the Maximum Likelihood Estimation?

29. June 2024

Unlocking insights: Understand Maximum Likelihood Estimation (MLE), a potent statistical tool for parameter estimation and data modeling.

Variance Inflation Factor (VIF) / Varianzinflationsfaktor

What is the Variance Inflation Factor (VIF)?

18. May 2024

Learn how Variance Inflation Factor (VIF) detects multicollinearity in regression models for better data analysis.

The selection procedures for research units are described in more detail here.

Niklas Lang

I have been working as a machine learning engineer and software developer since 2020 and am passionate about the world of data, algorithms and software development. In addition to my work in the field, I teach at several German universities, including the IU International University of Applied Sciences and the Baden-Württemberg Cooperative State University, in the fields of data science, mathematics and business analytics.

My goal is to present complex topics such as statistics and machine learning in a way that makes them not only understandable, but also exciting and tangible. I combine practical experience from industry with sound theoretical foundations to prepare my students in the best possible way for the challenges of the data world.

Population and Sample – simply explained!