Logistic regression is a special form of regression analysis that is used when the dependent variable, i.e. the variable to be predicted, can only take on a certain number of possible values. This variable is then also referred to as being nominally or ordinally scaled. Logistic regression provides us with a probability of assigning the data set to a class.

### Possible research questions

In use cases, it is actually very common that we do not want to predict a concrete numerical value, but simply determine in which class or range a data set falls. Here are a few classic, practical examples for which logistic regression can be used:

**Politics**: Which of the possible five parties will a person vote for if there were elections next Sunday?**Medicine**: Is a person “susceptible” or “not susceptible” to a certain disease depending on some medical parameters of the person?**Business**: What is the probability of buying a certain product depending on a person’s age, place of residence and profession?

### Logistic Regression Guide

In linear regression, we tried to predict a concrete value for the dependent variable instead of calculating a probability that the variable belongs to a certain class. For example, we tried to determine a student’s concrete exam grade as a function of the hours the student studied for the subject. The basis for the estimation of the model is the regression equation and, accordingly, the graph that results from it.

Example: We want to build a model that predicts the likelihood of a person to buy an e-bike, depending on their age. After interviewing a few subjects, we get the following picture:

From our small group of respondents, we can see the distribution that young people for the most part have not bought an e-bike (bottom left in the diagram) and older people in particular are buying an e-bike (top right in the diagram). Of course, there are outliers in both age strata, but the majority of respondents conform to the rule that the older you get, the more likely you are to buy an e-bike. We now want to prove this rule, which we have identified in the data, mathematically.

To do this, we have to find a function that is as close as possible to the point distribution that we see in the diagram and, in addition, only takes on values between 0 and 1. Thus, a linear function, as we have used it for the linear regression, is already out of the question, since it lies in the range between -∞ and +∞. However, there is another mathematical function that meets our requirements: the sigmoid function.

The functional equation of the sigmoid graph looks like this:

\(\) [S(x) = frac{1}{1+e^{-x}}]

Or for our example:

\(\) [P(text{Purchase E-Bike})) = frac{1}{1+e^{-(a + b_1 cdot text{Age})}}]

Thus, we have a function that gives us the probability of buying an e-bike as a result and uses the age of the person as a variable. The graph would then look something like this:

In practice, you often don’t see the notation we used. Instead, one rearranges the function so that the actual regression equation becomes clear:

\(\) [logit(P(text{Purchase E-Bike}) = a + b_1 cdot text{Age}]

### Interpretation of the Logistic Regression

The correlations between independent and dependent variables obtained in a logistic regression are not linear and thus cannot be interpreted as easily as in a linear regression.

Nevertheless, a basic interpretation is possible. If the coefficient before the independent variable (age) is positive, then the probability of the sigmoid function also increases as the variable increases. In our case, this means that if b1 is positive, the probability of buying an e-bike also increases with increasing age. The opposite is also true, of course, so if b1 is positive, the probability of e-bike purchase also decreases with decreasing age.

Furthermore, comprehensible interpretations with a logistic regression are only possible with great difficulty. In many cases, one calculates the so-called **odds ratio**, i.e. the ratio of the probability of occurrence and the probability of non-occurrence:

\(\) [odds = frac{p}{1-p}]

If you additionally form the logarithm from this fraction, you get the so-called **logit**:

\(\) [z = Logit = ln (frac{p}{1-p})]

This looks confusing. Let’s go back to our example to bring more clarity here. Assuming for our example we get the following logistic regression equation:

\(\) [logit(P(text{Kauf E-Bike})) = 0.2 + 0.05 cdot text{Alter}]

This function can be interpreted linearly, so one year increases the logit(p) by 0.05. The logit(p) is according to our definition nothing else than ln(p/(1-p)). So if ln(p/(1-p)), increases by 0.05, then p/(1-p), increases by exp(0.05) (Note: The logarithm ln and the e-function (exp) cancel each other). Thus, with every year that one gets older, the chance (not probability!) of buying an e-bike increases by exp(0.05) = 1.051, i.e. by 5.1 percent.

### This is what you should take with you

- Logistic regression is used when the outcome variable is categorical.
- We use the sigmoid function as the regression equation, which can only take values between 0 and 1.
- Logistic regression and its parameters are not as easy to interpret as linear regression.

### Other Articles on the Topic of Logistic Regression

- The University of Zurich has an interesting paper explaining logistic regression in detail and with examples.