What is Prediction Quality and how is it calculated for the different ML problem types?

Prediction quality is a metric appearing in the “Summary” and “Performance” pages of supervised ML models (Regression, Binary Classification, and Multi-class Classification) on the AI & Analytics Engine.

To understand what prediction quality is and how it is computed, it is first essential to know how models are evaluated. The dataset chosen by the user to create the supervised ML app is first split into separate train and test portions. Each model under the app is trained using only the train portion. This way, it is guaranteed that models do not “see” the test portion during training. The trained model will be evaluated on the test portion. During the evaluation, the target column is first removed from the test portion, and the remaining columns are fed as inputs for the model to make predictions.

This way, the outcomes in the test portion that the model is required to predict are “hidden” from the model. The predictions of the model on the test portion are then compared with the actual values in the target column which the model should have ideally predicted. This process is summarized in the diagram below:

machine learning model prediction quality

Prediction quality is one of the metrics that are computed in the above evaluation process. It is a measure of how close the supervised ML model’s predictions are to the actual values in the target column of the test/evaluation portion in question. It is a percentage value from 0% to 100%, where 0% indicates that the model’s predictions are completely off, and 100% indicates that the model predicts perfectly.

Even though the term Prediction Quality is used for all three problem types (Regression, Binary Classification, and Multi-class Classification), it is computed using a different underlying metric in each case.

Regression

For regression, the prediction quality is measured by the R2 score, also known as the “coefficient of determination” in the statistical community.

The range for the R2 score is 0 to 1, with 0 being the worst and 1 being the best. This score is multiplied by 100 to get a percentage value as the “Prediction Quality”.

Note: The usual method of computing the R2 score can result in a negative value if the model is really bad. In such cases, the engine automatically clamps it to a value of 0. Hence, a value of “0” is to be understood as an “R2 score of 0 or negative”.

Tip: To learn more about the R2 score and the other regression metrics, see this article on regression evaluation metrics.

Binary Classification

Binary classification presents two cases, depending on whether the positive class is the minority or not in the test/evaluation dataset:

  • When the positive class is the minority: Prediction quality is measured by the AUC PR (area under the precision-recall curve) metric. The range for this metric is 0 to 1, with higher values meaning that the model is better. The value of this metric is multiplied by 100 to get a percentage value as the “Prediction Quality”.

  • When the positive class is not the minority: Prediction quality is measured by the AUC ROC (area under the ROC curve) metric. The range for this metric is 0.5 to 1, with higher values meaning that the model is better. The value of this metric is multiplied by 100 to get a percentage value as the “Prediction Quality”.

The positive class is the minority when the number of rows with positive labels in the target column is less than 50% of the total number of rows.

Tip: To learn more about the ROC curve, the precision-recall (PR) curve, the area under curve (AUC) metrics, and other binary classification metrics, see this article on binary classification evaluation metrics.

Multi-class Classification

For multi-class classification, the prediction quality is measured by the macro-averaged F1 score metric. The range for this metric is 0 to 1, with higher values meaning that the model is better. The value of this metric is multiplied by 100 to get a percentage value as the “Prediction Quality”.

Tip: To learn more about the macro-averaged F1 score and other multi-class evaluation metrics, see this article on multi-class evaluation metrics.