Machine Learning without Code

Automate Machine Learning: XGBoost Algorithm [With & Without Code]

XGBoost has been making waves in the world of data science analytics and has become the go to algorithm of most data scientists for predictive modeling.


Explore one of the most common and powerful classification algorithms in machine learning: gradient boosted decision trees. 

We will implement XGBoost which has been making waves in the world of data science analytics and has become the go-to algorithm of most data scientists for predictive modelling. XGBoost stands for extreme gradient boosting.

In one of our earlier blogs, we implemented decision trees both with and without code using the same dataset that is used in this blog. Decision trees form the basis for more advanced machine learning algorithms such as gradient boosted trees. We recommend you to read that article first if you are not completely sure what are decision trees and how they work. Or to get started with no code machine learning, check out this article on Getting Started with Data Science with no Coding

We will first implement XGBoost in Python using the sci-kit learn library. Then, we will implement it on the AI & Analytics Engine which requires absolutely no programming experience.

About the Dataset

The dataset used contains a list of numeric features such as job, employment history, housing, marital status, purpose, savings, etc to predict whether a customer will be able to pay back a loan. 

The target variable is ‘good_bad’. It consists of two categories: good and bad. Good means the customer will be able to repay the loan whereas bad means the customer will default. It is a binary classification problem meaning the target variable consists of two classes only. 

This is a very common use case in banks and financial institutions as they have to decide whether to give a loan to a new customer based on how likely it is that the customer will be able to repay it.

Gradient Boosted Trees: background knowledge

Boosting in machine learning is an ensemble technique (ensemble means a group of items or things viewed as a whole rather than individually) that refers to a family of algorithms that convert weak learners to strong learners.

The main principle of boosting is to fit a sequence of weak learners to weighted versions of the data. More weight is given to examples that were misclassified by earlier rounds i-e the weighting coefficients are increased for misclassified data and decreased for correctly classified data. 

The new models are built on the residuals of the previous models. We keep adding new models sequentially until no further improvements can be seen. The predictions are then combined through a weighted majority vote.

In XGBoost, the ensembles are constructed from individual decision trees. Hence, multiple trees are trained sequentially and each new tree learns from the mistakes of its predecessor and tries to rectify them.

Hyperparameters for XG Boost

XGBoost has a lot of hyperparameters to fine-tune and hence it is not always easy to optimize a model built using XGBoost. There is no definite answer to which hyperparameters should you ideally tune or what values of hyperparameters would give the best results.

Some of the most common hyperparameters for XGBoost are:

Max_Depth: This determines how deep the tree will be. The deeper the tree is, the more the model will overfit.

Learning_rate: Also called shrinkage or eta,learning_rate comes in handy to slow down the learning of the gradient boosted trees to prevent the model from overfitting

Gamma: The value of gamma dictates the minimum value of loss required to make a split. If a node split does not lead to a positive reduction in the loss function, that split won’t happen.

Lambda: Penalty factor that is used to put a constraint on the weights just like in ridge regression 

Scale_pos_weight: This comes into play when there is a class imbalance in the target variable. If the classes are severely skewed, then a value of greater than zero should be used.

Python Implementation

(You can view and download the complete Jupyter Notebook here. And the dataset 'creditscoring.csv' can be downloaded from here)

Importing the Required Libraries

A list of all the libraries required to read the data, divide it into train & test data, build and evaluate the decision tree, optimize the hyperparameters using grid search, and plotting the ROC and Precision-Recall curves.

Libraries to read the data

Reading the Dataset

machine learning automation tools

Viewing the Dataset

Let’s view the first three rows of our dataset:

First three rows of the dataset

What’s the Target/ Response/ Dependent Variable?

Let’s have a look at our target variable ‘good_bad’

good_bad variable

What are the dimensions of our dataset?

dimensions of the dataset

Our dataset consists of 1000 observations and 20 columns. Hence, we have 19 features and ‘good_bad’ is our target variable.

Getting a ‘feel’ for our Data

It’s always a good idea to explore your dataset in detail before building a model. Let’s see the data types of all our columns

Column data types

We have 18 columns that are of type integer, 1 column that is of type float, and 1 column(our target variable ‘good_bad’) that is of type object meaning categorical.

Check for Missing Values

Always begin by identifying the missing values present in your dataset as they can hinder the performance of your model.

machine learning automation tools

The only missing values we have are present in the ‘purpose’ column.

Check for Class Imbalance

When solving a classification problem, you should ensure the classes present in the target variable should be balanced i-e there should be an equal number of examples for every class. A dataset with imbalanced classes results in models that have poor predictive performance, especially for the minority class.

Check for Class imbalanceAlthough the number of examples for good is slightly more than bad, it is not a severely imbalanced dataset and hence we can proceed with building our model. If for example, we had 990 examples for good and only 10 examples for bad, then that would have meant that our dataset is highly skewed and we should balance the classes.

Impute Missing Values

We begin by separating the features into numeric and categorical. The technique to impute missing values for numeric and categorical features is different. 

For categorical features, the most frequent value occurring in the column is computed and the missing value is replaced with that. 

For numeric features, there exists a range of different techniques such as calculating the mean value or building a model with the known values and predicting the missing values. The Iterative Imputer used below does exactly that

Impute missing values

The missing values have been imputed for the ‘purpose’ column. Now we have zero missing values in our dataset.

Separating into Features & Target Variable

Separating into features and target variable

X has all our features whereas y has our target variable. 

Note: We have converted our target variable which was categorical into numeric since XG boost requires all inputs to be numeric. This is called Label Encoding.

Splitting into Training & Testing Data

machine learning automation tools

Fitting the XG Boost Tree Model

Fitting the XG Boost Tree Model

The decision tree model has been fit using the default values for the hyperparameters since we did not specify any hyperparameters yet. We will do that in the next section.

Feature Importance & Feature Selection

Just like Decision trees, XGBoost is a very powerful technique to do feature selection. When you have many features and you do not know which one to include, a good way is to use XGBoost to assess feature importance.

Feature importance and feature selection

To build a model with better predictive performance, we can remove the features that are not significant.

(Comparison: You might want to compare the feature importance returned by the XG boost model vs the feature importance returned by the decision tree model here.)

Making Predictions on Test Data

Making predictions on testing

Plotting the Confusion Matrix

Plotting the confusion matrix

Viewing the Evaluation Metrics

Viewing the evaluation metrics

Tuning the Model through Hyperparameter Optimization

Tuning the model through hyperparameter optimization

Performing Grid Search on the Hyperparameters

 

Performing grid search on the hyperparameters

Viewing the Best Hyperparameters

Viewing the best hyperparameters

Plotting the ROC Curve

 

Plotting the ROC curve

Plotting the Precision-Recall Curve

Plotting the precision-recall curveAnd we are done with the Python implementation. Now let’s do the same on the AI & Analytics Engine 

AI & Analytics Engine Implementation

Upload the Dataset

Uploading the dataset on the AI & Analytics Engine

Give your Recipe a Name

Naming your recipe on the AI & Analytics Engine

Preview the Data

Preview the data on the AI & Analytics Engine


Select the Target Variable

Selecting the target variable on the AI & Analytics Engine

Convert Columns into Categorical & Numerical Data Types

The platform will automatically tell you which columns should be converted to categorical and which ones should be converted to numeric.

All you need to do is click on the + button right next to the suggestion, go to the recipe tab and click the commit button.

Convert columns into categorical and numerical data types

 

Convert columns to numeric data type on the AI & Analytics Engine

 

Convert columns into categorical and numerical data type on the AI & Analytics Engine

Feature Importance & Selection

In Python, we had to write code and draw a bar plot to identify the most important features.

The AI & Analytics Engine makes use of AI to predict which columns are not significant in helping the model learn and hence suggests the user to remove the non-predictive columns.

Feature importance and selection on the AI & Analytics Engine

Click on the commit button to implement the suggestions.

Committing the actions on the AI & Analytics Engine

Click on the Finalize & End button to complete the data wrangling process.

Finalize and end on the AI & Analytics Engine

Visualize the Data

The platform generates easy to understand charts and histograms for both the numeric columns and categorical columns to help you better understand the data and its spread.

Data visualisation on the AI & Analytics Engine

Build XG Boost Model

Click on the + button and click on New App

data visualisation on the AI & Analytics Engine

Give your application a name and since it’s a classification problem, select Prediction and choose the target variable from the dropdown menu.

Naming your application on the AI & Analytics Engine

Click on the + button as shown below and select Train New Model

Training a new model on the AI & Analytics Engine

Select XG Boost Classifier from the list of algorithms. As you can see, the platform includes all the machine learning algorithms that you can use without writing even a single line of code.

XG Boost classifier model on the AI & Analytics Engine

You can either select the Default configuration or the Advanced Configuration where you can tune your model by optimizing the hyperparameters.

Model configuration on the AI & Analytics Engine

A list of hyperparameters for the Decision Tree Classifier that we also optimized using GridSearch in Python.

Training the XG Boost Classifier model on the AI & Analytics Engine

Next, click on ‘Train Model’ and the platform will start the process of training your Decision Tree Classifier.

Our model has now been trained and is ready for evaluation.

Models on the AI & Analytics Engine

Model Summary

Model Summary on the AI & Analytics Engine

It’s now time to see how well the model has learned from the data. For this, the platform has inbuilt capabilities to generate evaluation metrics such as confusion matrix, classification report, ROC curve & PR curve.

Confusion Matrix

Confusion matrix on the AI & Analytics Engine

Precision, Recall & F1 Score

Precision, recall, and F1 score on the AI & Analytics Engine

In machine learning, you usually have to build multiple models and then compare their evaluation metrics with each other to select the model that best serves your business use case. We have implemented the decision tree classifier on the same dataset. You might be interested in drawing a comparison between XG boost model and the decision tree model here

ROC Curve

ROC curve on the AI & Analytics Engine

Precision-Recall Curve

Precision-recall curve on the AI & Analytics Engine

Wrap-Up

Implementing XG Boost or any other machine learning algorithm on the AI & Analytics Engine is intuitive, straightforward and seamless. We believe that by making data science easier for more people to actively participate (yes, you non-coders), it represents an opportunity to empower the lives of everyday business users, and drive better data-led decision making. 

Interested to discover what you can get out of The Engine?Free Trial

Similar posts

Subscribe to PI.EXHANGE emails!

Get the latest news, articles, and thought pieces direct to your inbox.