F1-Score - easily explained!

The evaluation of a model plays a crucial role in machine learning to determine whether it can be used for new applications or should be retrained. Various metrics, such as accuracy, precision, recall, or the F1-score, are used for this purpose. This combines the accuracy and recall metrics to provide a weighted metric that gives an honest statement about the performance of the model.

This article looks at the F1 score in detail and finds out how it works and why it is so important for evaluating a machine learning model. However, before we can start with this detailed analysis, the two components, precision, and recall, must first be understood in more detail.

How do you evaluate a classification?

In the simplest case, a classification consists of two states. Suppose we want to investigate how well corona tests reflect the infection status of a patient. In this case, the corona test serves as a classifier of a total of two states: infected (positive) or non-infected (negative).

These two classes can result in a total of four states, depending on whether the classification of the test was correct:

True Positive: The rapid test classifies the person as infected and a subsequent PCR test confirms this result. The rapid test was therefore correct.
False Positive: The rapid test is positive for a person, but a subsequent PCR test shows that the person is not infected, i.e. negative.
True Negative: The rapid test is negative and the person is not infected.
False Negative: The Corona rapid test classifies the person tested as healthy, i.e. negative, but the person is infected and should therefore have a positive rapid test.

What is Precision?

Precision is one of the most widely used metrics when analyzing models and measures the proportion of positive predictions in a classification that were correctly classified compared to the total number of positive cases in the data set. This measures the ability to prevent false positive errors. This case occurs

\(\) \[\text{Precision} = \frac{\text{True Positive}}{\text{True Positive + False Positive}}\]

In the case of spam detection, on the other hand, precision measures the proportion of emails that were spam, i.e. positive, about the emails that were classified as spam by the program. This proportion is made up of the true positives, i.e. the correctly predicted positive values, and the incorrectly classified positive values. A high precision value means that a large proportion of the emails declared as spam were spam and only a few emails ended up in the spam folder by mistake. A low value, on the other hand, indicates a high number of incorrectly declared spam emails, which can lead to a high level of frustration for the user.

Another error also indirectly influences the precision value. The false negative error occurs in cases where the machine learning model predicts a negative result, although this does not correspond to reality. In the case of a false negative error, actual spam emails would end up in the inbox because they were not recognized and were classified as normal emails. This error indirectly influences the precision, as fewer emails can be classified as true positives. This is because more true positive cases automatically mean fewer false negative cases and vice versa.

To optimize precision, it is therefore important to reduce the number of false positives while at the same time not increasing the number of false negatives too much and keeping them at an acceptable level. In applications where both false positives and false negatives are equally bad, looking at precision is usually not enough. Therefore, other metrics, such as recall or F1 score, should also be considered to get a more complete picture of model performance.

What is Recall?

In machine learning, recall refers to the ability of a model to correctly predict the important instances in a data set. The ratio between the correctly predicted positive instances (true positives) and the total number of all actual positive instances (true positives + false negatives) is calculated. This key figure is also used to measure that the number of false negatives is kept within limits, which was not taken into account in the precision.

\(\) \[\text{Recall} = \frac{\text{True Positives} }{\text{True Positives + False Negatives}}\]

Recall is particularly important in applications where the focus is on identifying all positive instances and where false predictions can also be accepted. In medicine, for example, it is more important to recognize all sick people, even if this means that a few healthy people may also be incorrectly classified as sick. However, if a sick patient is not recognized, this can have much more serious consequences than mistakenly treating a healthy person.

Sensitivity takes on values between 0 and 1, where 1 stands for a model that correctly predicts all positive data points as positive. Zero, on the other hand, means that the model has not correctly identified a single positive instance in the data set.

What is the F1-Score?

The F1-score is a popular metric for the evaluation of binary classification models and represents the harmonic mean of precision and recall. It is particularly useful if both metrics are to be weighted equally. It is calculated as the ratio of twice the product of precision and recall divided by their sum:

\(\) \[\text{F1} = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision + Recall}}\]

This score can take values between 0 and 1, with a high value indicating good model performance with high precision and high recall. This metric is particularly useful when the data set is unbalanced and therefore there is a risk that the model is biased towards the more frequent class. In applications such as medicine or fraud detection in the financial sector, healthy people or normal transactions occur much more frequently and are therefore also represented in the majority of data sets. However, to prevent the model from developing a tendency to frequently classify the data points as healthy people or normal transactions, the F1 score is used to account for this imbalance.

How can you interpret the score?

The F1-score is a key figure used to measure the performance of classifications. It represents a weighted average between precision and sensitivity. The values range from 0 to 1, with a value of 1 representing perfect accuracy and sensitivity. A value of 0, on the other hand, stands for a classification that performs poorly. In general, a higher value of the F1-score is equivalent to a better model.

However, the interpretation of the value varies from application to application depending on how strongly precision and sensitivity are to be weighed against each other. The F1-score forms the exact average of the two key figures, but one of the key figures may be slightly more important. The F1-score should therefore be interpreted specifically for the application and a general statement can be made about a sufficiently high F1-score.

Which applications use the F1-Score?

This metric is a mixture of precision and sensitivity and is used in classification problems to determine how good the model performance is. Such classifications occur in a wide variety of applications. Among the most common are:

Medical diagnosis: When diagnosing patients, it is important to ensure that the model identifies all sick individuals so that they can all receive appropriate treatment and hopefully get better. Good model performance is also characterized by a good balance between sensitivity and accuracy.
Fraud detection: Fraud detection is used, for example, in the financial sector to distinguish normal transactions from fraudulent transactions. These data sets are often unbalanced in the two classes, so it makes sense to use the F1 score.
Sentiment analysis: In the field of natural language processing, texts are analyzed to determine whether they were written with a positive, negative, or neutral sentiment. This makes it easy to pre-process large text files that contain product reviews, for example, to obtain an initial overview.

Das Bild zeigt einen Beispielsatz des Integrated Gradient Verfahrens. — Example of a Sentiment Analysis in NLP | Source: Author

Image classification: When processing images, classifications are made to decide whether certain objects or patterns are recognizable in the image.
Spam filtering: Many email providers offer spam filtering to protect the user from fraudulent or annoying emails and move them to the spam folder. However, if important, harmless emails also end up there, this can lead to a poor user experience. This is why good classification models are needed to stand out from the competition.

In general, the F1-score is a useful metric for any classification problem where both accuracy and sensitivity play a role.

What are the limitations of the F1-Score and valid alternatives?

Although the F1-score is widely used and very popular, it also has some limitations that should be considered before using it. Therefore, in this section, we take a closer look at the limitations of this metric and at the same time provide alternatives to circumvent these problems.

The main limitation of the F1-score is that it is the harmonic mean between precision and recall, so both metrics are equally important. In some applications, however, one of the two parameters is more important than the other. In fraud detection, for example, it may be more important to have a high recall, i.e. to detect as many cases of fraud as possible, while somewhat neglecting precision, as false positives would only lead to more investigations. For such cases, other metrics such as the F2 score or the F-beta score can be used, which weight recall higher than precision.

Another limitation is that true negatives are not taken into account. These are only taken into account indirectly via the false positives but are not actively optimized. If all cases of the confusion matrix are to be taken into account in the optimization, the Matthews correlation coefficient, for example, can be used to calculate a key figure from all cases.

What are alternatives for the F1-Score?

In addition to or instead of the F1-score, other metrics can also be used to evaluate classifications. Here are some of the most common metrics:

Precision and Recall: instead of evaluating the mixture of these two metrics, they can also be used individually, depending on how much emphasis is to be placed on each metric.
AUC-ROC: The area under the ROC curve is also often used to evaluate a model that has to decide between two classes. This evaluates the ability to correctly distinguish between the two classes.

ROC Curve Diagram — Example of the ROC Curve | Source: Author

Log loss: The log loss or binary cross-entropy measures the difference between the predicted and actual class probability. This is particularly useful for probabilistic models and is also used as a loss function for neural networks.
G-mean: Similar to the F1-score, the G-mean combines two key figures and forms the mean value from them. In this case, the geometric mean of sensitivity and specificity is formed.

Depending on the application and the respective model, one or more suitable evaluation metrics should be selected. These are also always dependent on the data set and its balance.

What are the best practices for using the F1-score?

When using the F1-score, it can be helpful to apply the following best practices:

Depending on the use case, ensure that the F1-score is a good metric for evaluating the classifier.
It should also be made clear what impact the respective errors, i.e. the false positives and false negatives, have on the problem and how critical they are.
Depending on the data set, the data should be pre-processed in such a way that the class distribution is as balanced as possible. If this is not possible or would lead to a significant reduction in the data set size, this can also be omitted.
Cross-validation can ensure that the F1 -score is even more meaningful by using different test sets on which the score is calculated.
The F1 score should not be the only metric used to assess performance. Depending on the application, other metrics should also be considered.
The threshold value determines the probability at which a prediction is assigned to one class or another. This should be chosen so that a good compromise between precision and recall is achieved.
Monitor the F1 score during training to be able to intervene at an early stage or identify optimization potential.

These best practices provide a good basis for the correct application of the F1 score and increase the probability of training a good and robust model.

This is what you should take with you

The F1-score is a popular choice for evaluating the performance of classification models.
It represents the harmonic mean of precision and recall and forms a single metric containing the two metrics.
The interpretation of the value depends on the particular application and the weighting towards recall or precision.
Other metrics such as the F2 score can overcome the limitations of the F1-score and should also be considered in the evaluation.
Effective model evaluation depends crucially on the choice of an appropriate metric, so this should not be underestimated.