This article explains the different metrics used to evaluate a multiclass classification model's performance and identifies the best metrics to do so.
Multiclass classification models classify each observation in a dataset into one of many categories. Evaluating these multiclass classification models for their performance, once they are trained, is crucial. The AI & Analytics Engine suggests the most suitable metric for this purpose as Prediction Quality. In addition, there are many other metrics available in the AI & Analytics Engine to easily compare and evaluate the trained multiclass classification models. In this article, we introduce these multiclass classification metrics available in the Engine.
Tip: For more information on prediction quality, and the metrics used for it, see this article.
How multiclass classification metrics are calculated
Typical multiclass classification problems produce a decision score (most models produce prediction probability as the decision score) for each class. The class with the highest decision score is considered the predicted class.
To calculate the evaluation metrics of a multiclass classification problem, it is first broken into multiple binary classification problems. This is done using the “One-vs-Rest (OVR)” strategy in the Engine. In OVR, the actual and predicted classes and the corresponding decision scores are re-calculated for each class, combining all other classes as a single class. This essentially creates a binary classification output. This is done for each class in the multiclass classification, converting the multiclass output into multiple binary classification outputs equal to the number of classes.
See the example below, where the output of a multiclass classification task with 3 classes: A, B, and C are converted to three binary classification outputs using the OVR strategy.



After forming these binary classification outputs, the binary classification metrics are calculated for each of them. The following metrics are calculated in the Engine.
Precision
- This calculates the proportion of the observations, predicted by the model as positive, that are actual positives.
Recall
- This calculates the proportion of the actual positive observations, that are predicted correctly by the model as positive.
F1 score
- This takes into account both precision and recall metrics by calculating their Harmonic mean.
False-positive rate
- This calculates the proportion of actual negative class observations that are incorrectly predicted by the model as positive.
Area under the Receiver Operating Characteristic Curve (AUC ROC)
- An ROC curve is a plot of the True positive rate (similar to recall) against the False positive rate. This AUC ROC metric measures the area under this curve. AUC ROC is the best measure to use in the case where the positive class is not the minority. Hence, it is used as a proxy for Prediction Quality on the Model Leaderboard page.
Area under the Precision-Recall Curve (AUC PR)
- As the name suggests, a PR curve is a plot of recall against precision. This AUC PR metric measures the area under this curve. AUC PR is the best measure to use in the case where the positive class is the minority. Hence it is used as a proxy for Prediction Quality on the Model Leaderboard page.
For the False-positive rate, the lower the value, the better the model’s performance. For all the other metrics, a higher value indicates better model performance. All these metrics have a range from 0 to 1.
Calculating a multiclass classification metric
For each metric listed above, the metric values for binary classification outputs of all classes in the multiclass classification problem are combined to calculate a multiclass classification metric. There are two such ways of combining:
-
Macro average - Calculates a simple average of the binary classification metric values of all classes.
-
Weighted average - Calculates the average of the binary classification metric value of all classes, weighted by the number of actual instances for each class (support).
Typically, the macro-averaged F1 score is the most suitable measure for multiclass classification. Therefore, it is used as a proxy for Prediction Quality in the Model Leaderboard page.
In addition, the following two metrics which can be directly calculated from the multiclass output without the OVR strategy, are also calculated and displayed in the engine.
-
Log loss
-
Accuracy score
Selecting the most suitable multiclass classification metrics
Identifying the correct metric to evaluate multiclass classification models is critical for inspecting model performance. This typically depends on the problem context which is known to the user in advance. Here are a few tips to guide this decision.
For example, unlike the binary case where precision and recall alone may not be the best option for performance evaluation, macro and weighted precision and recall scores by themselves can be good choices for multiclass classification.
The AI & Analytics engine also suggests the most suitable metric for multiclass classification as the Prediction Quality in the Model Leaderboard page of the supervised ML models. However, users who prefer to utilize other metrics, including log loss, accuracy score, and macro and weighted metrics have the option to view them under the “Performance” page for each model. =
Tip: Find out more about how to use the evaluate trained models in this article.