You may have noticed that the recipe (data wrangling) action catalogue does not contain certain ML pre-precessing actions.
In the Engine, actions such as Imputation, Normalization/Scaling of Numeric Features, One-Hot Encoding, and Vectorization of Text are termed “ML pre-processing actions”.
Such ML pre-processing actions do not appear in the recipe action catalogue. The reasons are detailed below:
-
When a user wants to build a supervised (classification/regression) ML model, these actions need to run after a train/test split has been made. Once they run on the train set, summary statistics computed from the train set are stored and used to transform the test set. This ensures that the models can be scientifically assessed on their ability to predict beyond the data they have seen.
-
The optimal pipeline of pre-processing actions depends on the algorithm chosen. For example, models like Logistic Regression do not allow missing/NA values anywhere in the dataset, while models like LightGBM can handle missing/NA values. Hence, for the Logistic Regression model, we would need to handle the missing/NA values as an ML pre-processing action by encoding the missing values via certain means, like imputation.
-
All ML pre-processing actions are handled in an automated fashion by selecting the necessary actions for each model template. The advantages of this approach are:
-
More automation, leading to higher productivity by automatically handling the data the right way.
-
Removes the need for redundant actions and transformations of the data.
-
-
Users are freed from low-level considerations (such as the need to impute missing values or standardize numeric columns) that can be handled automatically, allowing the user to focus on higher-level tasks.