Skip to content

What is the Confusion Matrix?

A confusion matrix is a tool for assessing the quality of a classification model in Machine Learning. It checks how many of the predictions were correctly or incorrectly assigned to a class.

How to judge a Classification?

In the simplest case, a classification consists of two states. Suppose we want to investigate how well Corona tests reflect the infection status of a patient. In this case, the Corona test serves as a classifier of a total of two states: infected or non-infected.

These two classes can result in a total of four states, depending on whether the classification of the test was really correct:

  • True Positive: The rapid test classifies the person as infected and a subsequent PCR test confirms this result. Thus, the rapid test was correct.
  • False Positive: The rapid test is positive for a person, but a subsequent PCR test shows that the person is actually not infected, i.e. negative.
  • True Negative: The rapid test is negative and the person is actually not infected.
  • False Negative: The Corona rapid test classifies the tested person as healthy, i.e. negative, however, the person is infected and should therefore have a positive rapid test.

What is a Confusion Matrix?

The confusion matrix helps to evaluate and neutrally assess the quality of a learned classification. Furthermore, specific metrics can be calculated more easily with the help of the matrix. To build the confusion matrix, one needs the test set of the dataset. The classifier assigns the classes to the data points.

The matrix is composed of the error types already mentioned. The rows are the predicted classes for the test set and the columns are the actual labels of the test set:

Das Bild zeigt den allgemeinen Aufbau der Konfusionsmatrix.
Structure of the Confusion Matrix

Assume that our test set for the Corona rapid tests includes 200 individuals, broken down by cell as follows:

Das Bild zeigt eine beispielhafte Konfusionsmatrix für den Klassifikator "Corona Schnelltest".
Example with Corona Tests

Specifically, the matrix results in the following values:

  • True Positive = 120: A total of 120 people were classified as infected by the rapid test and actually carrying the virus.
  • False Positive = 20: 20 people are sick according to the rapid test, but are not actually infected.
  • False Negative = 40: 40 subjects tested negative by the rapid test, but are actually infected.
  • True Negative = 20: 20 people were negative, and the rapid test confirmed this.

The confusion matrix is therefore often used to determine which type of error occurs frequently in the classifier. Our exemplary Corona quick test is correct in 70 % of the cases ((120 + 20) / 200), which is not a bad value at first glance. However, a false negative error occurs in 20 % (40 / 200) of all cases. This means that in 20% of all cases, the person is shown as healthy, although he is actually sick and contagious. In the case of a viral disease, therefore, it is not only the accuracy that is decisive but also the false negative rate.

These ratios can be read easily from a confusion matrix and then calculated.

What ratios can be calculated from the Confusion Matrix?

Since each use case focuses on different metrics, some of these metrics have evolved over time. In this chapter, we briefly present the most important ones.

Sensitivity

The sensitivity, or True Positive Rate, describes the cases in which positively classified data points were actually positive:

\(\) \[\text{Sensitivity} = \frac{\text{True Positive}}{\text{True Positive + False Negative}}\]

Specificity

Specificity, or True Negative Rate, measures all cases in which negatively classified data points were actually negative:

\(\) \[\text{Specificty} = \frac{\text{True Negative}}{\text{True Negative + False Positive}}\]

Precision

Precision is the relative frequency of correctly, positively classified subjects:

\(\) \[\text{Precision} = \frac{\text{True Positive}}{\text{True Positive + False Positive}}\]

Accuracy

We already know the accuracy of other types of models. It describes the total number of correctly classified persons in relation to all classifications:

\(\) \[\text{Accuracy} = \frac{\text{True Positive + True Negative}}{\text{True Positive + True Negative + False Positive + False Negative}}\]

Error Rate

The error rate is the opposite of the accuracy, i.e. the percentage of false classifications:

\(\) \[\text{Error Rate} = \frac{\text{False Positive + False Negative}}{\text{True Positive + True Negative + False Positive + False Negative}}\]

What is the difference between accuracy and precision?

At first glance, the formulas for accuracy and precision look relatively similar, but they measure slightly different circumstances. A good machine learning model must provide a good prediction for new, unseen data after training. However, such a model is only valuable if it provides good predictions very often and the good results do not occur infrequently. These properties can be measured with accuracy and precision.

The accuracy measures how close the predictions are to the desired values. In the case of classification, this means how often the predicted class corresponds to the actual class. Precision, on the other hand, measures how sharp the results are, that is, how reproducible the results are. Specifically, this means how close the results for similar property values are to each other. This gives an indication of the reproducibility of results since a model is not really useful if it gives two very different values for the same two predictions.

What is the F-Score?

The F-score, also known as the F1 score, is a widely used metric in machine learning and data analysis for evaluating the performance of classification models. It is a measure of a model’s precision and recall, where precision is the proportion of true positives among all predicted positives and recall is the proportion of true positives among all actual positives.

The F-score is a weighted harmonic mean of precision and recall, and it provides a single number that summarizes a model’s performance in terms of both precision and recall. The F-score ranges from 0 to 1, with a higher score indicating better model performance. The F1 score is often used in situations where there is an imbalance between the number of positive and negative examples in the data, as it provides a balanced evaluation of a model’s performance regardless of the class distribution.

Accuracy and recall are two important metrics to evaluate models and their predictions. In order to be able to track these better, one observes the so-called F-score. It puts the two metrics in context.

\(\) \[\text{F-Score} = \frac{2 \cdot \text{Recall} \cdot \text{Precision}}{\text{Recall} + \text{Precision}}\]

The use of the so-called harmonic mean, rather than the classical arithmetic mean, ensures that extreme values of one of the two quantities are penalized significantly more.

What is Supervised Learning?

Supervised Learning algorithms use datasets to learn correlations from the inputs and then use these to make the desired prediction. Optimally, the prediction and the label from the dataset are identical. The training dataset contains inputs and already the correct outputs for them. The model can use these to learn from it in several iterations. The accuracy in turn indicates how often the correct output could be predicted from the given inputs. This is calculated using the loss function and the algorithm tries to minimize it until a satisfactory result is achieved.

Supervised Learning can be divided into two broad categories:

Classification is used to assign new data objects to one or more predefined categories. The model tries to recognize correlations from the inputs that speak for the assignment to a category. An example of this is images that are to be recognized and then assigned to a class. The model can then predict for an image, for example, whether a dog can be recognized in it or not.

Regressions explain the relationship between inputs, called independent variables, and outputs called dependent variables. For example, if we want to predict the sales of a company and we have the marketing activity and the average price of the previous year, the regression can provide information about the influence of marketing efforts on sales.

How to create a Confusion Matrix in Python?

Once you have trained a classifier in Python, the truth matrix can be created using the “Scikit-Learn” module. To do this, we just need two arrays with the predicted classes for the test set and the actual labels for the test set. In our case, the classification model can distinguish between the classes “cat”, “ant” and “bird”. Accordingly, we expect a confusion matrix with three rows and three columns.

# Import Scikit-Learn
from sklearn.metrics import confusion_matrix

# True Labels of Testset
y_true = ["cat", "ant", "cat", "cat", "ant", "bird"]

# Predicted Labels of Testset
y_pred = ["ant", "ant", "cat", "cat", "ant", "cat"]


# Create Confusion Matrix
confusion_matrix(y_true, y_pred, labels=["ant", "bird", "cat"])

Out: 
array([[2, 0, 0],
       [0, 0, 1],
       [1, 0, 2]], dtype=int64)

This confusion matrix can now also be used to calculate the true positives, false positives, etc. from it. Classification with more than two classes can be quite complicated by hand.

# Get Figures of a two-class Classification
tn, fp, fn, tp = confusion_matrix([0, 1, 0, 1], [1, 1, 1, 0]).ravel()
(tn, fp, fn, tp)

Out: 
(0, 2, 1, 1)

# Get Figures of a multi-class Classification
confusion_matrix(y_true, y_pred, labels=["ant", "bird", "cat"]).ravel()

Out: 
array([2, 0, 0, 0, 0, 1, 1, 0, 2], dtype=int64)

This is what you should take with you

  • The confusion matrix helps in the evaluation of classification models.
  • In most cases, the confusion matrix contains four fields: True Positive, True Negative, False Positive, and False Negative.
  • The fields can be used to calculate specific metrics that help evaluate the classifier.
N-gram

What are N-grams?

Unlocking NLP's Power: Explore n-grams in text analysis, language modeling, and more. Understand the significance of n-grams in NLP.

No-Free-Lunch Theorem

What is the No-Free-Lunch Theorem?

Unlocking No-Free-Lunch Theorem: Implications & Applications in ML & Optimization

Automated Data Labeling

What is Automated Data Labeling?

Unlock efficiency in machine learning with automated data labeling. Explore benefits, techniques, and tools for streamlined data preparation.

Synthetic Data Generation / Synthetische Datengenerierung

What is Synthetic Data Generation?

Elevate your data game with synthetic data generation. Uncover insights, bridge data gaps, and revolutionize your approach to AI.

Multi-Task Learning

What is Multi-Task Learning?

Boost ML efficiency with Multi-Task Learning. Explore its impact on diverse domains from NLP to healthcare.

Federated Learning

What is Federated Learning?

Elevate machine learning with Federated Learning. Collaborate, secure, and innovate while preserving privacy.

Das Logo zeigt einen weißen Hintergrund den Namen "Data Basecamp" mit blauer Schrift. Im rechten unteren Eck wird eine Bergsilhouette in Blau gezeigt.

Don't miss new articles!

We do not send spam! Read everything in our Privacy Policy.

Niklas Lang

I have been working as a machine learning engineer and software developer since 2020 and am passionate about the world of data, algorithms and software development. In addition to my work in the field, I teach at several German universities, including the IU International University of Applied Sciences and the Baden-Württemberg Cooperative State University, in the fields of data science, mathematics and business analytics.

My goal is to present complex topics such as statistics and machine learning in a way that makes them not only understandable, but also exciting and tangible. I combine practical experience from industry with sound theoretical foundations to prepare my students in the best possible way for the challenges of the data world.

Cookie Consent with Real Cookie Banner