Do you find feature selection a challenging task when dealing with high-dimensional datasets?
When dealing with large datasets, one often needs to extract the most relevant features to build models so they can be trained faster, have higher predictive accuracy, and are less prone to overfitting.
Also known as variable selection or attribute selection, feature selection is one of the most time-consuming steps involved in data preparation. It becomes all the more important when dealing with high-dimensional datasets to be able to extract meaningful insights and for easier model interpret-ability.
In this article, we will go through some feature selection methods implemented in python that will help you quickly extract the most relevant features that you can then feed into your machine learning model. Each algorithm or technique will return different sets of features that you should then train your model on and compare the results to identify the most optimal features that lead to the highest predictive performance and accuracy.
About the Dataset
The dataset we are going to analyze today is taken from the telecom industry where we predict whether a customer will churn or not based on numeric and categorical input features. This is a common use case across industry.
The data set includes information about:
Customers who left – the column is called Churn
Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
Demographic info about customers – gender, age range, and if they have partners and dependents
You can access the complete Jupyter notebook here
Importing All the Required Libraries
Reading the data file
Viewing the dataset
CustomerID does not provide any useful information so we drop it from our dataset.
Our target variable Churn is a categorical variable with two classes: Yes and No. So we encode it to numeric using the map function.
After we drop the CustomerID column, we are now left with 20 columns i-e 19 features and 1 target variable.
View the data type of each column
The column TotalCharges is of the type object where it should be numeric. That can happen sometimes when we read a data file in pandas. So we explicitly convert it to numeric.
Machines only understand numeric data, not text. So we now one-hot encode the categorical columns and turn them into numeric.
We now have 46 columns whereas previously we had 20 i-e 45 features and 1 target variable.
Now that we have preprocessed our data, we are ready to implement feature selection techniques.
Let’s get the ball rolling, shall we?
1. Correlation with the Target Variable
One of the quickest ways to filter the irrelevant features is to check their correlation with the target variable.
Features with the strongest positive correlation: Contract_Month-to-month, OnlineSecurity_No, TechSupport_No, InternetService_Fiber optic, PaymentMethod_Electronic check
Features with the strongest negative correlation: tenure, Contract_Two year, OnlineBackup_No internet service, TechSupport_No internet service, StreamingTV_No internet service
Features with almost zero correlation: gender_Female, gender_Male, PhoneService_Yes, OnlineBackup_Yes, DeviceProtection_Yes
*Features that have very low correlation coefficients can be discarded since they do not provide much useful information for the model.
2. Decision Trees
The bar plot gives a very good intuition as to which features contribute the most to the target variable and which features can be easily discarded.
Features with the highest score: TotalCharges, MonthlyCharges, Contract_Month-to-month, tenure, InternetService_Fiber optic
Features with very little significance: OnlineBackup_No internet service, DeviceProtection_No internet service, OnlineSecurity_No internet service, TechSupport_No internet service, InternetService_DSL, StreamingTV_No internet service, StreamingMovies_No internet service
Features with the highest score: InternetService_Fiber optic, Contract_Month-to-month, Contract_Two year, OnlineSecurity_No, TechSupport_No
Features with very little significance: OnlineBackup_No internet service, OnlineSecurity_No internet service, TechSupport_No internet service, InternetService_No, StreamingTV_No internet service, DeviceProtection_No internet service, MultipleLines_No phone service, PhoneService_Yes, StreamingMovies_No internet service, Partner_Yes, PaperlessBilling_Yes, gender_Male, Dependents_Yes
XGBoost returns more features than decision trees that have negligible score i-e they are irrelevant for our machine learning model. Since XGBoost is an ensemble of multiple decision trees, it is more robust and hence it is a good idea to do feature selection using an ensemble method.
4. Step Forward Feature Selection
Step forward feature selection starts with the evaluation of each individual feature, and selects that which results in the best performing selected algorithm model. What's the "best?" That depends entirely on the defined evaluation criteria (AUC, prediction accuracy, RMSE, etc.). Next, all possible combinations of that selected feature and a subsequent feature are evaluated, and a second feature is selected, and so on, until the required predefined number of features is selected.
We use the random forest classifier as our model, roc_auc score as the evaluation metric and are interested in selecting the top 5 features. Remember, if we choose a random forest classifier as our algorithm to identify the best features, then the returned best features should be trained using a random forest model, otherwise the set of features may not give optimal results with a different algorithm.
Let’s view the top 5 features selected by the model:
5. Step Backward Feature Selection
Step backward feature selection is closely related to Step forward feature selection. It starts with the entire set of features and works backward from there, removing features to find the optimal subset of a predefined size.
Let’s view the top 5 features selected by the model:
Note: Both Step Forward Feature Selection and Step Backward Feature Selection can be computationally very expensive as they take a long time to extract the relevant features. If you have a very high dimensional dataset, they might not be the most feasible choice to do feature selection.
Automate the Feature Selection Process
Wouldn’t it be awesome if there was a way to let the software do all the above for you? After all, remembering all the techniques, implementing them in Python and then doing the comparison to choose the optimal set of features can be daunting.
Much to your delight, there exists a solution that will take care of not just the feature selection but the entire machine learning life cycle.
Once you upload your dataset, The Engine will automatically suggest the features that should be dropped to improve the model's predictive performance. All you need to do is to click the button and commit the action so the platform can do feature selection for you. Amazing, isn’t it?
Let's see how the whole process works:
Upload the dataset
Convert columns into their respective data types
Drop the non-predictive columns
Feature selection is one of the most important steps in building your model. There is no hard and fast rule for it and there are a number of techniques that can be used to arrive at the optimal set of features that will result in a model with the best predictive accuracy.
We implemented some of the most commonly used methods using Python to extract the best features. There is an alternative - using a no-code platform like The AI & Analytics Engine, mitigating the hassle of coding to extract the best features.