What is the right approach for comparing and evaluating trained models?

ML models first have to be built on the training set and then evaluated on the test set to evaluate their effectiveness.

NOTE: This specifically applies to classification & regression problem types

Before building a machine learning model, we first need to separate the dataset into a training set and a test set.

The model is trained on the training set, and the test set should be “unseen” by the modelling algorithm.

Once a machine learning model has been built on the training set, it is evaluated on the unseen "test" dataset to quantify its effectiveness.

This is the scientific way to measure a model’s performance and ensure that the model can make accurate predictions beyond the data it has seen.

The above also applies when comparing multiple models or for comparing different versions of the same model. The evaluation results of different models/model versions can be compared only if:

The evaluation results are computed for the same test set, and
None of the models/model versions has ‘seen’ any part of this test set at the time of training.