Train/test split is an important concept in the training and evaluation of classification and regression models. When done correctly it can lead to accurate estimates of the model’s prediction quality for future data
Train/test split is an important step in the application of Machine Learning
The train/test split is important to solve classification and regression problems, where either a categorical or continuous column/variable is to be predicted from the other columns/variables in the dataset. The dataset is split into two portions called “train” and “test” portions.
The train portion is presented to the model as labelled data at the time of training so it can learn from the examples in it to make future predictions. Once a model is trained, it is made to predict for the test portion using only the knowledge of the data in the other columns.
A train/test split ratio defines how much of the data is used for training, and how much is for the test portion. It is usually presented as a percentage value. For eg., 80% data for training implies an 80-20 split where 20% of the data is used for the test portion.
The train/test split on the AI & Analytics Engine
On the AI & Analytics Engine, the train/test split is automated for the user at the time of app creation, for apps that predict a column i.e., classification and regression apps. The Engine automatically ensures that it is done correctly in accordance with the best practices recommended in the field of Data Science and Machine Learning.
The train-test split ratio can only be configured during the app-creation step
This is done is to keep the train and test portions the same for all machine learning models trained in the app. This in turn is because models need to be evaluated on the same test portion (a.k.a. hold out or validation portion) of the data to be compared with each other and ranked on a leader board.
💡See this page to learn how to configure the train/test split for classification and regression apps.