Linear Regression – Basics

Regressions are used to establish a mathematical relationship between two variables x and y. Both statistics and machine learning fundamentals are concerned with how to explain the variable y using one or more variables x. Here are a few examples:

• What influence does the learning time (=x) have on the exam grade (=y)?
• How is the use of plant fertilizer (=x) related to the actual harvest (=y)?
• How does the crime rate in a city (=y) change depending on the number of police officers in the city (=x)?

The variable y to be explained is called dependent variable, criterion or regressand.

The explanatory variable x, on the other hand, is the so-called independent variable, predictor or regressor.

The goal of linear regression is to formulate a mathematical relationship that describes the influence of x on y also in numbers:

 $y = β_0 + β_1x + u$

• β0: Intersection with the y-axis, e.g. the exam grade that would be achieved without studying.
• β1: Increase of the regression line, e.g. the influence that an additional hour of studying has on the exam grade.
• u: Error term, e.g. all influences that have an effect on the exam grade but are not captured by the learning time, e.g. prior knowledge.
• y: Variable, which one would like to predict with the help of linear regression.
• x: Variable that is used as the basis of the prediction and has an effect on y.

Figuratively speaking, this means that we try to find the straight line through the point cloud of data sets that has the smallest distance to all points.

Example: Interpretation of the regression equation

Suppose we get the following regression equation for the exam preparation example:

 $y = 5.0 – 0.2x + u$

In this case, the y-axis intercept (β0) is 5.0, which means that without studying, the students are likely to complete the exam with a grade of 5.0. Since the prior knowledge or left over knowledge from the course is represented in the error term.

The regression weight (β1) in this case is -0.2, so for every hour spent studying for the subject, the exam grade becomes 0.2 grades better. Thus, with 5 hours of study, the student’s expected grade would be 1.0 better compared to the case where no study was done.

Overall, according to this regression, students could expect a final grade of 3.0 after 10h of learning. At the same time, they can read that they should study at least 5h to pass.

Error term

In our previous remarks, we did not go into the error term u in more detail, although it has a crucial meaning for the interpretation of the regression. If we use only one or two independent variables for a regression, this will not be sufficient in many cases to represent all influencing factors on the dependent variable y. Of course, it is not only the number of hours studied that determines the final exam grade. There are several other factors that play a role, e.g. the handling of stressful situations or the number of lectures attended.

This circumstance is not bad for the time being, since we only select the independent variables that are of interest for our evaluation. In our example, we only want to make an explicit statement about the relationship between learning and the exam grade. Therefore, we do not need to explicitly list the number of attended lectures as a variable but can leave it as one of many in the error constant.

However, it becomes critical when the independent variable “learning hours” correlates with a factor (see Correlation and Causation) that is still hidden in the error constant. Then the regression factor (β1) is not correct and we make a mistake in the interpretation.

Suppose we want to determine how the level of education affects the wage per hour. To do this, we use education in years as the independent variable x and current hourly wage as the dependent variable y:

 $\text{Hourly Wage} = β_0 + β_1 \cdot \text{(Eduction in Years)} + u$

The error term in this example would be factors such as seniority, number of promotions, or general intelligence. In this case, problems may arise if we use this equation to interpret β1 as the impact that an additional year of education has on hourly wages. Indeed, the intelligence factor is most likely positively correlated with the education variable. A person with a higher intelligence quotient is also very likely to have a higher education degree and thus to have spent more years in school or at university.

This is what you should take with you

• Linear regression is a special case of regression analysis.
• It attempts to find a linear function that describes how the independent variable x influences the dependent variable y.

Other Articles on the Topic of Linear Regression

• The mathematical basics are described here in more detail than in our article.