Which metrics are used to evaluate a binary classification model's performance?

This article explains the different metrics used to evaluate a binary classification model's performance and identifies the best metrics to do so.

Binary classification models classify each observation in a dataset into one of two categories. Once the classification task is completed, the results need to be evaluated to inspect its performance. Based on the characteristics of the dataset, the AI & Analytics Engine suggests the most suitable metric as Prediction Quality, for that purpose. There are many other metrics available in the AI & Analytics Engine to easily compare and evaluate the trained binary classification models. In this article, we introduce the binary classification metrics available in the Engine.

Tip: For more information on Prediction quality, and the metrics used to calculate it, see this article

Binary classification metrics available in the Engine

Binary classification models typically produce a decision score (most models produce prediction probability as the decision score), and a threshold is used to calculate the predicted class based on this. As such, binary classification metrics can be categorized into two categories:

  • Metrics that depend on a threshold decision score, and

  • Metrics that are independent of a threshold.

These metrics require the positive class label to be specified.

Threshold-dependent metrics

These are metrics that require a threshold, based on the decision score, to determine the predicted class label. These predicted class labels are then compared with the actual class labels to calculate the metrics.

True positive count (TP)

  • This is the number of observations where both the actual class and predicted class from the model are positive.

True negative count (TP)

  • This is the number of observations where both the actual class and predicted class from the model are negative.

False-positive count (FP)

  • This is the number of observations where the predicted class is positive but the actual class is negative.

False-negative count (FP)

  • This is the number of observations where the predicted class is negative but the actual class is positive.

Precision

  • This calculates the proportion of the observations, predicted by the model as positive, that are actual positives.

Recall

  • This calculates the proportion of the actual positive observations, that are predicted correctly by the model as positive.

F1 score

  • This takes into account both precision and recall metrics by calculating their Harmonic mean.

False-positive rate

  • This calculates the proportion of actual negative class observations that are incorrectly predicted by the model as positive.

 

Specifically for False-positive count, False-negative count, and False-positive rate, the lower the value, the better the model. For all the other metrics, a higher value indicates better model performance. All these metrics have a range from 0 to 1.

Threshold-independent metrics

These metrics try to capture the model’s performance across all possible decision score thresholds. Hence, these metric values are independent of a particular threshold. There are three such metrics available in the Engine.

Area Under the Receiver Operating Characteristic Curve (AUC ROC)

  • The ROC curve is a plot of the True positive rate (similar to recall) against the False positive rate. This AUC ROC metric measures the area under this curve. AUC ROC is the best measure to use in the case where the positive class is not minority. Hence, it is used as a proxy for Prediction Quality on the Model Leaderboard page.

Area under the Precision-Recall Curve (AUC PR)

  • As the name suggests, a PR curve is a plot of recall against precision. This AUC PR metric measures the area under this curve. AUC PR is the best measure to use in the case where the positive class is the minority. Hence it is used as a proxy for Prediction Quality in the Model Leaderboard page.

Log loss

  • This metric calculates the negative of the average logarithm value of the actual class’s predicted probabilities.

A higher value in AUC ROC and AUC PR indicates a better model, while a lower log loss value corresponds to a better model. AUC ROC and AUC PR can have a value range from 0 -1, while log-loss values can be anywhere from 0 to infinity.

Selecting the most suitable binary classification metrics

Selecting the correct metric for a particular binary classification task is crucial. The correct metric typically depends on factors such as the problem the user is trying to solve and the distribution of class labels.

For example, when the positive class of the binary classification task is the minority, AUC PR is considered a good metric. Medical diagnosis and fraud detection can be practical examples of such scenarios. On the other hand, if the positive class is not the minority, AUC ROC is preferred.

Recall and precision by themselves may not be good choices. A model can achieve the highest recall of 1.0 by always outputting the positive class without looking at the input. With precision, the model can output the negative label if the input does not lie in a very narrow range where it is 100% sure. The F1 score takes into account both precision and recall.

Threshold-dependent metrics depend on a chosen decision score threshold and are only relevant for that particular threshold. Therefore, if the threshold were to be changed, the metric value will change as well. On the other hand, threshold-independent metrics are measures of the model’s performance regardless of a threshold. These metrics evaluate how well the model “separated” the two classes during predictions. As such, they can be better indicators of a model’s performance in the majority of the binary classification tasks compared to threshold-dependent metrics.

The AI & Analytics engine suggests the most suitable metric for each binary classification task based on the characteristics of the dataset. The most suitable metric is shown as the Prediction Quality on the Model Leaderboard page of the supervised ML models. 

Tip: Users who prefer to use other metrics, have the option to view them under the detailed report for each model, by following the instructions on evaluating trained models in this article