Which metrics are used to evaluate a binary classification model's performance?

This article explains the different metrics used to evaluate a binary classification model's performance and identifies the best metrics to do so.

Binary classification models classify each observation in a dataset into one of two categories. Once the classification task is completed, the results need to be evaluated to inspect its performance. Based on the characteristics of the dataset, the AI & Analytics Engine suggests the most suitable metric as “Prediction Quality”, for that purpose. There are many other metrics available in the AI & Analytics Engine to easily compare and evaluate the trained binary classification models. In this page, we introduce the binary classification metrics available in the Engine.

Binary classification metrics available in the Engine

Binary classification models typically produce a decision score (most models produce prediction probability as the decision score), and a threshold is used to calculate the predicted class based on this. As such, binary classification metrics can be categorized into two categories:

  • Metrics that depend on a threshold decision score, and

  • Metrics that are independent of a threshold.

These metrics require the positive class label to be specified.

Threshold-dependent metrics

These are metrics that require a threshold, based on the decision score, to determine the predicted class label. These predicted class labels are then compared with the actual class labels to calculate the metrics.

True positive count (TP)

This is the number of observations where both the actual class and predicted class from the model are positive.

True negative count (TP)

This is the number of observations where both the actual class and predicted class from the model are negative.

False-positive count (FP)

This is the number of observations where the predicted class is positive, but the actual class is negative.

False-negative count (FP)

This is the number of observations where the predicted class is negative, but the actual class is positive.

Precision

This calculates the proportion of the observations, predicted by the model as positive, that are actual positives.

Recall

This calculates the proportion of the actual positive observations, that are predicted correctly by the model as positive.

F1 score

This takes into account both precision and recall metrics by calculating their Harmonic mean.

False-positive rate

This calculates the proportion of actual negative class observations that are incorrectly predicted by the model as positive.

 

Specifically for False-positive count, False-negative count, and False-positive rate, the lower the value, the better the model. For all the other metrics, a higher value indicates better model performance. All these metrics have a range from 0 to 1.

Threshold-independent metrics

These metrics try to capture the model’s performance across all possible decision score thresholds. Hence, these metric values are independent of a particular threshold. There are three of such metrics available in the Engine:

1. Area Under the Receiver Operating Characteristic Curve (AUC ROC)

The ROC curve is a plot of the True positive rate (similar to recall) against the False positive rate. This AUC ROC metric measures the area under this curve.

AUC ROC is the best measure to use in the case where the positive class is not a minority. Hence, it is used as a proxy for Prediction Quality in the Model Leaderboard page.

2.  Average Precision score (adjusted)

The PR curve is a plot of the precision as function of the recall.

Similar to AUC ROC, the AUC PR is the area under the PR curve. This would be the metric of choice, however, we do not use it because, in some situations, for high enough threshold values, the precision might not be defined* and there is no “area” beneath undefined precision values.

Hence, to allow for comparison of area under the curve between models that have a full curve and those that do not, we defined the Average Precision score (adjusted).

It is the scaled area under the PR curve, which is the area under the all the defined points of the PR curve times a scaling factor that corrects for the “missing area” beneath the undefined points, if any.

If the curve is fully defined, the AP (adjusted) is identical to the AUC PR.

 

Average Precision score (adjusted) is the best measure to use in the case where the positive class is the minority. Hence it is used as a proxy for Prediction Quality in the Model Leaderboard page.

 

3. Log loss

This metric calculates the negative of the average logarithm value of the actual class’ predicted probabilities.

A higher value in AUC ROC and AUC PR indicates a better model, while a lower log loss value corresponds to a better model. AUC ROC and AUC PR can have a value range from 0 -1, while log-loss values can be anywhere from 0 to infinity.

Selecting the most suitable binary classification metrics


Selecting the correct metric for a particular binary classification task is crucial. The correct metric typically depends on factors such as the problem the user is trying to solve, and the distribution of class labels.

For example, when the positive class of the binary classification task is the minority, the Adjusted Average Precision Score is considered a good metric. Medical diagnosis and fraud detection can be practical examples of such scenarios. On the other hand, if the positive class is not the minority, AUC ROC is preferred.

Recall and precision by themselves may not be good choices. A model can achieve the highest recall of 1.0 by always outputting the positive class without looking at the input. With precision, the model can output the negative label if the input does not lie in a very narrow range where it is 100% sure. The F1 score takes into account both precision and recall.

Threshold-dependent metrics depend on a chosen decision score threshold, and are only relevant for that particular threshold. Therefore, if the threshold were to be changed, the metric value will change as well. On the other hand, threshold-independent metrics are measures of the model’s performance regardless of a threshold. These metrics evaluate how well the model “separated” the two classes during predictions. As such, they can be better indicators of a model’s performance in the majority of the binary classification tasks as compared to threshold-dependent metrics.

The AI & Analytics engine suggests the most suitable metric for each binary classification task based on the characteristics of the dataset. The most suitable metric is shown as the “Prediction Quality” in the “Model Leaderboard” page of the supervised ML models. However, users who prefer to use other metrics have the option to view them under the detailed report for each model by following the instructions here.

*Precision is defined as TP / (TP + FP). For a high enough threshold, we might not capture any true positive (TP = 0), but also, since the threshold is too high, we may not have false positives as well (FP = 0). This leads to a Precision = 0 / 0 which is not defined.