What are estimated feature importance scores in the context of feature selection?

This article will describe how the feature selection in the AI & Analytics Engine takes advantage of the importance of features in order to improve the user experience.

Machine learning models learn how to create a mapping between a set of inputs, commonly referred to as features and an output, which is commonly known as a “target”.

For example: A model that estimates the current value of a car.

  • The set of inputs (features) would be attributes of a car’s make, model, age, odometer value, condition, etc.

  • The output (target) would be the car's value.

However, not all features are equal because some features contribute more to the estimated car value, as compared to others.

For example, the number of previous owners impacts the car valuation more than the age of the car tires.
The marginal contribution of each feature to the predicted target value can be estimated once a model is trained. Features that contribute more are considered more important.

For more information about feature importance, read What are feature importance values.

Feature importance in the context of a generic app creation

When you create a regression or classification app on the AI & Analytics Engine, there is a feature-selection step where you have two options:

  1. Let the Engine automatically select the best features

  2. Manually select features from the training dataset columns

Both of these options rely on an estimation of the feature importance of the training dataset columns. In both options, you can specify one or more columns to be entirely excluded while the Engine analyzes your dataset to determine the most important columns. This is useful in two scenarios:

  1. Target leakage: In this scenario, you have a proxy of the target column that you will actually not be having in a real deployment. You may have created such columns to explore and analyze your training dataset. For instance, in a model predicting rainfall, a column Total Rainfall Next Day might be a proxy for your target, but it should be removed as it will not be available in real deployment.

  2. Irrelevant columns: You may have irrelevant columns such as customer ID, product description, URLs, etc. that are irrelevant to your prediction problem. Including such columns can slow down the feature selection process as such irrelevant columns can be computationally expensive to analyze.

💡The Engine uses Generative AI to smartly detect and suggest columns to be excluded from consideration as candidate features.

Excluding such columns provides a way for you to ensure that only relevant features are considered.

Automatic selection of best features

In this option, the Engine selects a subset of features from the candidate features, such that their overall total estimated importance is larger than a percentage threshold (the default is 95%). This means that only the columns that cumulatively contribute 95% of the importance will be used for training a model.

By dropping the less important columns, training time reduced, model explainability is increased and expected model performance degradation is minimal.

Manually select features from the training dataset columns

In this option, the user manually selects the subset of features to be used for model training. Feature importances are estimated for all columns except the ones that you agree to exclude. Their estimated importance is displayed as auxiliary information, helping you to choose features. The information provides insights as to what features might be the most predictive in your final trained model, and what features may be redundant.