Evaluating machine learning models' performance in the Engine

This article explains how to evaluate machine learning model performance, and what metrics and plots are available in the AI & Analytics Engine.

What is model evaluation?

Model evaluation is the process of using metrics and plots to quantify and visualize the performance of a machine learning model. The evaluation metrics and plots are typically derived by analyzing the trained model’s predicted output of the target value when using the test portion features as the prediction input, against the test portion’s actual target value.

This process requires the data to be split in two before training the model (train/test split). The training portion is used to train the model, and the test portion is kept out for testing. The AI & Analytics Engine automatically creates the train/test split and evaluation metrics and plots. The train/test split can be configured during the app creation phase.

For more information on train/test split, see: What is the train/test split for classification and regression apps?

Diagram illustrating the train/test split and the relationship between the predicted target column and the actual target column as input to the model evaluation

Model evaluation metrics and plots are useful to indicate how well the model makes predictions. When used as part of the ML development process, they can guide model optimization and help anticipate real-world model performance, ensuring alignment with business objectives.

Evaluating machine learning model performance in the Engine

Within the Engine, model evaluation metrics and plots are automatically generated after a model has been trained.

Basic model performance evaluation

Basic evaluation metrics are quickly accessible from the Model Leaderboard page. These include prediction quality and training time. Each model will show a prediction quality, ranging from 0% to 100%.

Models basic evaluation metrics from the leaderboard page.

Models basic evaluation metrics from the leaderboard page.

Prediction quality is calculated differently for different machine learning problem types such as regression, binary classification, and multi-class classification.

For more information on prediction quality see: What is Prediction Quality and how is it calculated for the different ML problem types?

Advanced model performance evaluation

The Engine has several advanced evaluation metrics and plots. What is displayed depends on whether the ML problem type is a regression, binary, or multi-class classification problem.

To access advanced metrics and plots navigate to the Model Leaderboard page, and select a model.

3. Model leaderboard page after the model has been trained

Model Leaderboard page, after the model has been trained.

From the model “Insights” tab, the “Performance” tab contains advanced evaluation metrics and plots. In this view, a full list of evaluation metrics and relevant plots related to the problem type are displayed.

Model Insights tab, viewing the Model Performance tab

Model Insights tab, viewing the Model Performance tab

What metrics and plots are available for regression

Basic metrics

Prediction quality
Prediction Error
Percentage Error

Additional metrics

R2 Score
Explained Variance
RMSE (Root Mean Squared Error)
MABE (Mean Absolute Error)
MedABE (Median Absolute Error)
MAPE (Mean Absolute Percentage Error)
MSLE (Mean Squared Logarithmic Error)

For a detailed overview of regression metrics and their meaning see: Which metrics are used to evaluate a regression model's performance?

Plots

Predicted vs. actual values
Residuals vs. predicted values
Residuals distribution

What metrics and plots are available for binary classification?

Metrics

Precision
Recall
F1 Score
FPR (False Positive Rate)
AUC-ROC (Area Under Curve - Receiver Operating Characteristic)

For a detailed overview of binary classification metrics and their meaning see: Which metrics are used to evaluate a binary classification model's performance?

Plots

Precision recall curve
ROC curve
Confusion Matrix

What metrics and plots are available for multi-class classification

Metrics

Macro and weighted average of
- F1 Score
- Precision
- Recall
- FPR (False Positive Rate)
- AUC-ROC (Area Under Curve - Receiver Operating Characteristic)
- Average Precision Score
Log-loss Score
Accuracy

For detailed overview of multi-class classification metrics and their meaning see: Which metrics are used to evaluate a multiclass classification model's performance?

Plots

Multi-class confusion matrix