What is the ROC Curve?

The receiver operating characteristic curve, or ROC curve for short, is a popular metric in machine learning for evaluating the quality of classification models. It can be used to graphically compare different threshold values and their impact on overall performance. Specifically, it compares the trade-off between the True Positive Rate (TPR) and the False Positive Rate (FPR) at different thresholds. In this article, we want to take a closer look at the ROC curve and better understand its interpretation and applications.

How to judge a Classification?

In the simplest case, a classification consists of two states. Suppose we want to investigate how well Corona tests reflect the infection status of a patient. In this case, the Corona test serves as a classifier of a total of two states: infected or non-infected.

These two classes can result in a total of four states, depending on whether the classification of the test was correct:

True Positive: The rapid test classifies the person as infected and a subsequent PCR test confirms this result. Thus, the rapid test was correct.
False Positive: The rapid test is positive for a person, but a subsequent PCR test shows that the person is not infected, i.e. negative.
True Negative: The rapid test is negative and the person is not infected.
False Negative: The Corona rapid test classifies the tested person as healthy, i.e. negative, however, the person is infected and should therefore have a positive rapid test.

What is the ROC Curve and how to interpret it?

The ROC curve is a graphical representation for assessing the quality of a classifier. For this purpose, a two-dimensional diagram is created that plots the true positive rate on the y-axis and the false positive rate on the x-axis. An existing classification model is then used and various threshold values are tested. A threshold value specifies the probability at which an instance is evaluated as positive or negative. For example, a model that decides whether patients are classified as sick or healthy does not directly return two classes, but a probability, such as 0.67. The threshold value decides at which of these probabilities the patient is really classified as sick.

The compromise is that a high threshold value ensures that the true positive rate is rather low, as the very certain cases are really classified as sick. However, the high threshold value also means that potentially ill people are not detected as they are below the threshold. In the same way, the false positive rate is also rather low, as the number of false negatives is higher, which results in a lower false positive rate.

To create the curve, various threshold values are tested and the corresponding rates are plotted. This results in an exponentially increasing curve. The perfect model would also have a point in the top left-hand corner. This is the point with a true positive rate of 1 and a false positive rate of 0.

ROC Curve Diagram — Example of a ROC Curve | Source: Author

In addition, a diagonal line from bottom left to top right is entered in the diagram, as this represents a completely random decision between positive and negative, which is always 50% correct. In reality, the curves of models are somewhere between this line and the top left-hand corner. The steeper the curve and the closer it is to the upper left-hand corner, the better the model is to be evaluated.

It is important to note that this graph does not make any statement about predictions of individual data points, but merely represents an overall impression of the model that makes it comparable with other models. Other metrics such as precision, recall or F1 score should be used in addition to get a more general picture of the model’s performance.

What is the Area under the Curve?

The ROC curve is a good graph to show the trade-off between the true positive rate and the false positive rate of a model. The curves of individual classifications can also be compared by plotting them on a common graph. However, a quantification to summarize the measure of this graph is still missing for good comparability.

In order to have a value for the classification performance, the area under the curve (AUC) is calculated. This area can have a value between 0 and 1, with a value of 1 representing a perfect classifier. The larger the area under the curve, the better the model. Mathematically speaking, this area indicates the probability that the model classifies a random, positive instance higher than a randomly selected negative instance.

A completely randomized model that would assign each instance after a coin toss would have an AUC value of 0.5 (the area under the red dashed line). So if a model achieves a value below 0.5, it is worse than a random assignment of instances. An AUC value above 0.5 means that the trained model is better than a random decision. A value close to 1 indicates that the classifier can already distinguish very well between positive and negative instances.

The advantage of the AUC value as a performance measure is that it is not dependent on the class distribution in the data set and the selection of the threshold value. In addition, this key figure offers a good opportunity to illustrate the performance of the model by interpreting it as a probability. For example, an AUC value of 0.7 can be interpreted to mean that the model rates a random positive instance higher than a random negative instance with a probability of 70%.

How can you use the ROC Curve in classifications with multiple categories?

The ROC curve as we have seen it so far is defined exclusively for binary classifications. In practice, however, there are so-called multi-class problems in which a model has to learn to assign a data point to one of several classes. In order to still be able to use the ROC curve, the approach must be changed to a binary situation.

To do this, one class is taken out and defined as a positive class, the remaining classes are summarized as negative classes. This is known as a one-versus-all (OVA) approach. The ROC curve can then be calculated for each of the cases in which one of the classes is the positive class. These curves can then be combined to form a multi-class curve.

This multi-class curve can usually be created with a so-called micro- or macro-averaging. In micro averaging, the true positive, false positive and false negative classifications are added together for all cases and converted into a single curve. All classes are weighted equally regardless of their individual size, even if they occur with different frequencies in the initial data set. With macro averaging, on the other hand, a separate curve is calculated for each class. The mean value is then calculated from all curves. With this approach, the classes are also weighted equally regardless of their size.

The ROC curve is usually used for binary classifications, but can also be extended to multi-class problems using the methods described. The AUC remains a valuable measure for evaluating the overall performance of the model and can be compared with other models for the same data set in order to find a suitable model architecture.

How does it compare to other evaluation metrics?

The ROC curve is a popular graph suitable for binary classifications. However, it is only one of many ways to evaluate classification models and it should be decided which evaluation metric is most appropriate depending on the application. In certain scenarios, other metrics such as recall or F1-score may be more relevant.

In medical diagnosis, it is important to have a high hit rate and to recognize all positive instances, i.e. sick patients. To achieve this, it is also acceptable that a certain degree of precision is lost and therefore some false positive errors may be included. In such cases, the ROC curve may not be optimally suited as it weights the true positive rate and the false positive rate equally.

In many applications, balanced data sets, with equal numbers of positive and negative instances, can be very difficult to create. In cases with unbalanced data, the ROC curve is not meaningful and does not adequately reflect the performance of the model. In such applications, the precision-recall curve may be more suitable as it is optimized for unbalanced data sets.

Overall, the ROC curve is a very useful and widely used evaluation metric, but it should be adapted to the application. In addition, depending on the use case, it should also be decided whether other metrics may be useful in order to obtain a more general picture of the model.

Can you use it in case of imbalanced datasets?

The ROC curve is a popular choice for evaluating binary classification models, but it can have problems with unbalanced data sets and give a false picture of model performance. In practice, datasets are often unbalanced because the positive class is usually underrepresented compared to the negative class. In medical analysis, for example, datasets usually contain more healthy patients than sick patients, or in spam detection there are often more normal emails than spam emails.

In such cases, simply looking at the accuracy of the classifier is not a good evaluation metric as it can be deceptive. Assuming a data set contains 70 % negative instances, a model that always classifies all instances as negative can already achieve an accuracy of 70 %. Although the ROC curve gives a more honest picture here, it can still be deceptive as it focuses on both classes and can therefore give an overly optimistic picture.

In applications with more unbalanced data sets, the precision-recall curve should therefore be used, as it concentrates exclusively on the positive class, i.e. the smaller class, and therefore evaluates the performance of the model more independently than the ROC curve.

This is what you should take with you

The so-called Receiver Operating Characteristic (ROC) curve serves as a graphical representation of the performance of a binary classifier.
For this purpose, the true positive rate and the false positive rate are plotted in a two-dimensional diagram. A curve is created by plotting different threshold values.
The threshold values determine how high the predicted probability of the model must be for an instance to be recognized as positive.
The shape of the curve provides information about the performance. The graph of a very good model runs close to the upper left point of the diagram.
In addition, the area under the curve (AUC) is calculated, which is a key figure for comparing different models.
The ROC curve is originally only defined for binary classifications, but can also be extended to other applications using the so-called multi-class approach.
The AUC value can be used as an additional assessment metric for evaluating a classification model.