This article will explain the typical flow from data to trained and evaluated models on the AI & Analytics Engine
NOTE: This specifically applies to classification & regression problem types
The flow from dataset creation to the evaluation of trained models is as follows:
In this article, we will explain the two steps highlighted in dark blue in the above illustration:
ML Preprocessing Pipeline: Pre-processing the input data into numeric data, using only the columns from the feature set selected while creating the model.
Hyperparameter Tuning and Model Fitting: Finding the best combination of hyperparameters for the algorithm selected by the user to train the model, and fitting the final model on the train portion using the best hyperparameters found.
In the Engine, both steps are automated to save time for users. Let us explain these two steps in further detail.
ML Preprocessing Pipeline
This step is necessary since model-training algorithms typically require all inputs to be numeric, all values to be filled and none of them missing, and values in all input columns to be in the same scale/range. None of these are possible in real-world datasets. This gap is bridged by the ML preprocessing pipeline. Here, categorical and text features in real-world datasets are converted to numbers using techniques such as One-Hot Encoding and the TF-IDF transformer, missing values in the data are imputed using mean/median values, and all numeric features are scaled using a transformation to be roughly in the same range.