PIEXCHANGE
 
 

Automate Machine Learning: XGBoost Algorithm [With & Without Code]

Explore one of the most common and powerful classification algorithms in machine learning: gradient boosted decision trees. 

We will implement XGBoost which has been making waves in the world of data science analytics and has become the go to algorithm of most data scientists for predictive modeling. XGBoost stands for extreme gradient boosting.

In one of our earlier blogs, we implemented decision trees both with and without code using the same dataset that is used in this blog. Decision trees form the basis for more advanced machine learning algorithms such as gradient boosted trees. We recommend you to read that article first if you are not completely sure what are decision trees and how they work.

We will first implement XGBoost in Python using the sci-kit learn library. Then, we will implement it on the AI & Analytics Engine which requires absolutely no programming experience.

About the Dataset

The dataset used contains a list of numeric features such as job, employment history, housing, marital status, purpose, savings, etc to predict whether a customer will be able to pay back a loan. 

The target variable is ‘good_bad’. It consists of two categories: good and bad. Good means the customer will be able to repay the loan whereas bad means the customer will default. It is a binary classification problem meaning the target variable consists of two classes only. 

This is a very common use case in banks and financial institutions as they have to decide whether to give a loan to a new customer based on how likely it is that the customer will be able to repay it.

Gradient Boosted Trees: background knowledge

Boosting in machine learning is an ensemble technique (ensemble means a group of items or things viewed as a whole rather than individually) that refers to a family of algorithms that convert weak learners to strong learners.

The main principle of boosting is to fit a sequence of weak learners to weighted versions of the data. More weight is given to examples that were misclassified by earlier rounds i-e the weighting coefficients are increased for misclassified data and decreased for correctly classified data. 

The new models are built on the residuals of the previous models. We keep adding new models sequentially until no further improvements can be seen. The predictions are then combined through a weighted majority vote.

In XGBoost, the ensembles are constructed from individual decision trees. Hence, multiple trees are trained sequentially and each new tree learns from the mistakes of its predecessor and tries to rectify it.

Hyperparameters for XG Boost

XGBoost has a lot of hyperparameters to fine tune and hence it is not always easy to optimize a model built using XGBoost. There is no definite answer to which hyperparameters should you ideally tune or what values of hyperparameters would give the best results.

Some of the most common hyperparameters for XGBoost are:

Max_Depth: this determines how deep the tree will be. The deeper the tree is, the more the model will overfit.

Learning_rate: also called shrinkage or eta,learning_rate comes in handy to slow down the learning of the gradient boosted trees to prevent the model from overfitting

Gamma: the value of gamma dictates the minimum value of loss required to make a split. If a node split does not lead to a positive reduction in the loss function, that split won’t happen.

Lambda: penalty factor that is used to put a constraint on the weights just like in ridge regression 

Scale_pos_weight: this comes into play when there is class imbalance in the target variable. If the classes are severely skewed, then a value of greater than zero should be used.

Python Implementation

Let’s start by implementing a decision tree classifier on the dataset using Jupyter Notebook

Importing the Required Libraries

A list of all the libraries required to read the data, divide into train & test data, build and evaluate the decision tree, optimize the hyperparameters using grid search, and plotting the ROC and Precision-Recall curves.

machine learning automation tools

Reading the Dataset

machine learning automation tools

Viewing the Dataset

Let’s view the first three rows of our dataset

machine learning automation tools

What’s the Target/ Response/ Dependent Variable?

Let’s have a look at our target variable ‘good_bad’

machine learning automation tools

What are the dimensions of our dataset?

machine learning automation tools

Our dataset consists of 1000 observations and 20 columns. Hence, we have 19 features and ‘good_bad’ is our target variable.

Getting a ‘feel’ for our Data

It’s always a good idea to explore your dataset in detail before building a model. Let’s see the data types of all our columns

machine learning automation tools

We have 18 columns that are of type integer, 1 column that is of type float, and 1 column(our target variable ‘good_bad’) that is of type object meaning categorical.

Check for Missing Values

Always begin by identifying the missing values present in your dataset as they can hinder the performance of your model.

machine learning automation tools

The only missing values we have are present in the ‘purpose’ column.

Check for Class Imbalance

When solving a classification problem, you should ensure the classes present in the target variable should be balanced i-e there should be an equal number of examples for every class. A dataset with imbalanced classes results in models that have poor predictive performance, especially for the minority class.

machine learning automation toolsAlthough the number of examples for good is slightly more than bad, it is not a severely imbalanced dataset and hence we can proceed with building our model. If for example, we had 990 examples for good and only 10 examples for bad, then that would have meant that our dataset is highly skewed and we should balance the classes.

Impute Missing Values

We begin by separating the features into numeric and categorical. The technique to impute missing values for numeric and categorical features is different. 

For categorical features, the most frequent value occurring in the column is computed and the missing value is replaced with that. 

For numeric features, there exists a range of different techniques such as calculating the mean value or building a model with the known values and predicting the missing values. The Iterative Imputer used below does exactly that

machine learning automation tools

machine learning automation tools

The missing values have been imputed for the ‘purpose’ column. Now we have zero missing values in our dataset.

Separating into Features & Target Variable

machine learning automation tools

X has all our features whereas y has our target variable. 

Note: We have converted our target variable which was categorical into numeric since XG boost requires all inputs to be numeric. This is called Label Encoding.

Splitting into Training & Testing Data

machine learning automation tools

Fitting the XG Boost Tree Model

machine learning automation tools

The decision tree model has been fit using the default values for the hyperparameters since we did not specify any hyperparameters yet. We will do that in the next section.

Feature Importance & Feature Selection

Just like Decision trees, XGBoost is a very powerful technique to do feature selection. When you have many features and you do not know which one to include, a good way is to use XGBoost to assess feature importance.

machine learning automation tools

machine learning automation tools

To build a model with better predictive performance, we can remove the features that are not significant.

(Comparison: You might want to compare the feature importance returned by the XG boost model vs the feature importance returned by the decision tree model here.)

Making Predictions on Test Data

best free data science software

Plotting the Confusion Matrix

best free data science software

Viewing the Evaluation Metrics

best free data science software

Tuning the Model through Hyperparameter Optimization

best free data science software

Performing Grid Search on the Hyperparameters

best free data science software

best free data science software

Viewing the Best Hyperparameters

best free data science software

best free data science software

Plotting the ROC Curve

best free data science software

best free data science software

best free data science software

Plotting the Precision-Recall Curve

machine learning for non programmersAnd we are done with the Python implementation. Now let’s do the same on the AI & Analytics Engine 

AI & Analytics Engine Implementation

Upload the Dataset

machine learning for non programmers

Give your Recipe a Name

machine learning for non programmers

Preview the Data

machine learning for non programmers


Select the Target Variable

machine learning for non programmers

Convert Columns into Categorical & Numerical Data Types

The platform will automatically tell you which columns should be converted to categorical and which ones should be converted to numeric.

All you need to do is click on the + button right next to the suggestion, go to the recipe tab and click the commit button.

machine learning for non programmers

 

machine learning for non programmers

 

machine learning for non programmers

Impute Missing Values

The platform automatically detects which columns have missing values and suggest you to impute the missing values for them.

As we saw earlier, the only column that has missing values is ‘purpose’

machine learning for non programmers

Feature Importance & Selection

In Python, we had to write code and draw a bar plot to identify the most important features.

The AI & Analytics Engine makes use of AI to predict which columns are not significant in helping the model learn and hence suggests the user to remove the non-predictive columns.

machine learning for non programmers

machine learning for non programmers

Click on the commit button to implement the suggestions.

machine learning for non programmers

Click on the Finalize & End button to complete the data wrangling process.

machine learning for non programmers

Visualize the Data

The platform generates easy to understand charts and histograms for both the numeric columns and categorical columns to help you better understand the data and its spread.

machine learning for non programmers

Build XG Boost Model

Click on the + button and click on New App

machine learning for non programmers

Give your application a name and since it’s a classification problem, select Prediction and choose the target variable from the dropdown menu.

best place to learn data science

Click on the + button as shown below and select Train New Model

best place to learn data science

Select XG Boost Classifier from the list of algorithms. As you can see, the platform includes all the machine learning algorithms that you can use without writing even a single line of code.

best place to learn data science

You can either select the Default configuration or the Advanced Configuration where you can tune your model by optimizing the hyperparameters.

best place to learn data science

A list of hyperparameters for the Decision Tree Classifier that we also optimized using GridSearch in Python.

best place to learn data science

Next, click on ‘Train Model’ and the platform will start the process of training your Decision Tree Classifier.

Our model has now been trained and is ready for evaluation.

best place to learn data science

Model Summary

best place to learn data science

It’s now time to see how well the model has learned from the data. For this, the platform has inbuilt capabilities to generate evaluation metrics such as confusion matrix, classification report, ROC curve & PR curve.

Confusion Matrix

best place to learn data science

Precision, Recall & F1 Score

best place to learn data science

In machine learning, you usually have to build multiple models and then compare their evaluation metrics with each other to select the model that best serves your business use case. We have implemented the decision tree classifier on the same dataset. You might be interested in drawing a comparison between XG boost model and the decision tree model here

ROC Curve

best place to learn data science

Precision-Recall Curve

best place to learn data science

Wrap-Up

Implementing XG Boost or any other machine learning algorithm on the AI & Analytics Engine is intuitive, straightforward and seamless. We believe that by making data science easier for more people to actively participate (yes, you non-coders) it represents an opportunity to empower the lives of everyday business users, and drive better data-led decision making. 

Interested to discover what you can get out of The Engine? Free Trial

use_case_img
One-Hot Encoding: The Art of Handling Categorical Data in Data Science
February 12, 2021

Let’s suppose you have a dataset on which you want to apply a machine learning algorithm. You check the data types of all the feat...

Read more
use_case_img
Machine Learning Classification Algorithm: Decision Trees [Tutorial with & without code]
January 14, 2021

In this blog, we will implement one of the most common classification algorithms in machine learning: decision trees

Read more