Getting Started with AI/ML

Machine Learning Classification Algorithm: Decision Trees


In this blog, we will implement one of the most common classification algorithms in machine learning: decision trees

We will first implement it in Python using the sci-kit learn library. This requires experience in using Python as a scripting language. Then, we will implement it on the AI & Analytics Engine which requires absolutely no programming experience. It is a drag and drop data science and machine learning platform for business users to data scientists - who want to quickly and seamlessly automate an accelerated insight extraction from data (Bonus: No prior coding experience required).

Don’t believe what you just read? You can get started with the platform without paying a dime.

About the dataset

The dataset used contains a list of numeric features such as job, employment history, housing, marital status, purpose, savings, etc. to predict whether a customer will be able to repay a loan.

The target variable is ‘good_bad’. It consists of two categories: good and bad. Good means the customer will be able to repay the loan whereas bad means the customer will default.

It is a binary classification problem meaning the target variable consists of two classes only.

This is a very common use case in banks and financial institutions as they have to decide whether to give a loan to a new customer based on how likely it is that the customer will be able to repay it.

Decision Trees: Background knowledge

Meme on dropping some knowledge

 

A decision tree is a supervised machine learning algorithm that breaks down a data set into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes.

A decision tree can handle both categorical and numerical data. Decision trees are the building units of more advanced machine learning algorithms such as boosted decision trees and, hence, serve as a baseline model for comparison for tree-based algorithms.

Hyperparameters for decision tree

The performance of a machine learning model can be improved by tuning its hyperparameters. Hyperparameters are those parameters that the user has to set in advance. They are not learned by the data during training.

Some of the most common hyperparameters for a decision tree are:

Criterion: This parameter determines how the impurity of a split will be measured. Possibilities are ‘gini’ or ‘entropy.

Splitter: How the decision tree searches the features for a split. The default is set to ‘best’ meaning for each node, the algorithm considers all the features and chooses the best split. If it is set to random, then a random subset of features will be considered. The split will be made by the best feature within the random subset. The size of the random subset is determined by the ‘max_features’ parameter.

Max_Depth: This determines how deep the tree will be. The default is none and this often results in overfitting. The max_depth parameter is one of the ways in which we can regularize the tree, or limit the way it grows to prevent over-fitting.

Min_samples_split: The minimum number of samples a node must contain to consider splitting. The default is set to 2. This again is another parameter to regularize the decision tree.

Max_features: The number of features to consider when looking for the best split. By default, the decision tree will consider all available features to make the best split.

Python Implementation

Let’s start by implementing a decision tree classifier on the dataset using Jupyter Notebook. 

(You can view and download the complete Jupyter Notebook here. And the dataset 'creditscoring.csv' can be downloaded from here)

Importing the Required Libraries

A list of all the libraries required to read the data, divide it into train & test data, build and evaluate the decision tree, optimize the hyperparameters using grid search, and plotting the ROC and Precision-Recall curves.

Importing required libraries

Reading the Dataset

Reading the dataset

 

Viewing the Dataset

Let’s view the first three rows of our dataset:

Viewing the first 3 rows of the dataset

 

What’s the Target/ Response/ Dependent Variable?

Let’s have a look at our target variable ‘good_bad’:

Looking at a specific target variable

 

What are the dimensions of our dataset?

Dataset dimensions

Our dataset consists of 1000 observations and 20 columns. Hence, we have 19 features and ‘good_bad’ is our target variable.

Getting a ‘feel’ for our Data

It’s always a good idea to explore your dataset in detail before building a model. Let’s see the data types of all our columns.

Column data types

 

We have 18 columns that are of type integer, 1 column that is of type float, and 1 column(our target variable ‘good_bad’) that is of type object meaning categorical.

Check for Missing Values

Always begin by identifying the missing values present in your dataset as they can hinder the performance of your model.

Checking data for missing value

The only missing values we have are present in the ‘purpose’ column.

Check for Class Imbalance

When solving a classification problem, you should ensure the classes present in the target variable should be balanced i-e there should be an equal number of examples for every class. A dataset with imbalanced classes results in models that have poor predictive performance, especially for the minority class. In many models, the minority or the extreme class is much more important from a business perspective and hence this issue needs to be taken care of while preparing the data.

An example would help us illustrate our point better.

There are two customers A and B. A has a good credit history and if given a loan would be able to repay the loan. However, our model predicts A to be a ‘bad’ customer, and hence he is refused a loan. This does hurt the bank since they lost a customer who would have been able to repay the loan and the bank could have made a profit.

Now let’s consider customer B. B has a bad credit history and files an application for a loan. In reality, B would not be able to repay the loan. However, the model predicts b to be a ‘good’ customer, and hence his loan application is approved. This is much more damaging for the bank since they would not be able to recover the loan.

These two examples illustrate why having balance classes can be so important from a business perspective.

Checking for class imbalances

Although the number of examples for good is slightly more than bad, it is not a severely imbalanced dataset and hence we can proceed with building our model. If for example, we had 990 examples for good and only 10 examples for bad, then that would have meant that our dataset is highly skewed and we should balance the classes.

Impute Missing Values

We begin by separating the features into numeric and categorical. The technique to impute missing values for numeric and categorical features is different.

For categorical features, the most frequent value occurring in the column is computed and the missing value is replaced with that.

For numeric features, there exists a range of different techniques such as calculating the mean value or building a model with the known values and predicting the missing values. The Iterative Imputer used below does exactly that

Imputing missing values

 

Imputing missing values

The missing values have been imputed for the ‘purpose’ column. Now we have zero missing values in our dataset.

Separating into Features & Target Variable

Separating into Features and target variable

X has all our features whereas y has our target variable.

Splitting into Training & Testing Data

Data splitting into training and testing sets

 

Fitting the Decision Tree Model

Fitting the decision tree model

The decision tree model has been fit using the default values for the hyperparameters since we did not specify any hyperparameters yet. We will do that in the next section.

Feature Importance & Feature Selection

Decision trees are widely used to do feature selection. When you have many features and you do not know which one to include, a good way is to use decision trees to assess feature importance even though you might be using a completely different machine learning algorithm for your model.

Feature importance and feature selection

From the plot, we can see that the most important feature in predicting whether a customer will repay a loan or not is the amount, followed by age, purpose, duration, and so on. The least important features that do not contribute much to the target variable is foreign, depends, coapp, resident.

To build a model with better predictive performance, we can remove the features that are not significant.

Making Predictions on Test Data

Making predictions on test data

 

Plotting the Confusion Matrix

Plotting the confusion matrix

 

Viewing the Evaluation Metrics

Viewing the evaluation metrics

 

Tuning the Model through Hyperparameter Optimization

Tuning the model through hyperparameter optimization

Performing Grid Search on the Hyperparameters

Performing grid search on the hyperparameters

 

Performing grid search on the hyperparameters

 

Viewing the Best Hyperparameters

Viewing the best hyperparameters

 

Making Predictions on Test Data with our Optimized Decision Tree Model

making test predictions on test data

 

Plotting the ROC Curve

Plotting the ROC curve

 

Plotting the Precision-Recall Curve

Plotting the precision recall curve

 

Plotting the precision recall curve

Evaluation Metrics for the Optimized Decision Tree Model

Evaluation metrics for the optimized decision tree model

 

Evaluation metrics for the optimized decision tree model

If you compare these new metrics with the old ones before the hyperparameters were tuned, you would see a significant improvement in precision and recall.

BAM!

And we are done with the Python implementation. That was quite a bit of code that you had to write and understand. For someone who is new to the realm of data science and machine learning, it might feel like a lot of information to digest at first glance.

The fear of not knowing how to code should not stop you from implementing data science.

We have built an all in one platform that takes care of the entire machine learning lifecycle. From importing a dataset to cleaning and preprocessing it, from building models to visualizing the results and finally deploying it, you won’t have to write even a SINGLE line of code. The goal is to make machine learning and data science accessible to everyone!

The AI & Analytics Engine Implementation

Upload the Dataset

Upload the dataset on the AI & Analytics Engine

 

Give your Recipe a Name

Give your recipe a name on the AI & Analytics Engine

 

Preview the Data

Preview the data on the AI & Analytics Engine

Select the Target Variable

Select the target variable on the AI & Analytics Engine

 

Convert Columns into Categorical & Numerical Datatypes

The platform will automatically tell you which columns should be converted to categorical and which ones should be converted to numeric.

All you need to do is click on the + button right next to the suggestion, go to the recipe tab and click the commit button.

Convert columns into categorical and numerical data types on the AI & Analytics Engine

 

convert columns into numerical datatypes on the AI & Analytics Engine

 

converting columns into numerical and categorical data types on the AI & Analytics Engine
 

Feature Importance & Selection

In Python, we had to write code and draw a bar plot to identify the most important features.

The AI & Analytics Engine makes use of AI to predict which columns are not significant in helping the model learn and hence suggests the user to remove the non-predictive columns.

Feature importance and selection on the AI & Analytics Engine

Click on the commit button to implement the suggestions.

Commit recipe actions on the AI & Analytics Engine

Click on the Finalize & End button to complete the data wrangling process.

Finalize and end on the AI & Analytics Engine

 

Visualize the Data

The platform generates easy to understand charts and histograms for both the numeric columns and categorical columns to help you better understand the data and its spread.

Data visualisation on the AI & Analytics Engine

 

Build a Decision Tree Model

Click on the + button and click on New App

Data visualisation on the AI & Analytics Engine

 

Give your application a name and since it’s a classification problem, select Prediction and choose the target variable from the dropdown menu.

Naming your application on the AI & Analytics Engine

 

Click on the + button as shown below and select Train New Model

Training a new model on the AI & Analytics Engine

Select Decision Tree Classifier from the list of algorithms. As you can see, the platform includes all the machine learning algorithms that you can use without writing even a single line of code.

Choosing a model on the AI & Analytics Engine

 

You can either select the Default configuration or the Advanced Configuration where you can tune your model by optimizing the hyperparameters.

Configuring the model on the AI & Analytics Engine

 

A list of hyperparameters for the Decision Tree Classifier that we also optimized using GridSearch in Python.

Decision tree classifier on the AI & Analytics Engine

Next, click on train Model and the platform will start the process of training your Decision Tree Classifier.

Our model has now been trained. It took about 5 minutes since we selected the advanced configuration in order to optimize the hyperparameters.

Decision tree classifier on the AI & Analytics Engine

Model Summary

Model summary on the AI & Analytics Engine

It’s now time to see how well how the model has learned from the data. For this, the platform has inbuilt capabilities to generate evaluation metrics such as confusion matrix, classification report, ROC curve & PR curve.

Confusion Matrix

Confusion matrix on the AI & Analytics Engine

 

Precision, Recall & F1 Score

Precision, recall, and F1 score on the AI & Analytics Engine

 

ROC Curve

ROC curve on the AI & Analytics Engine

 

Precision-Recall Curve

Precision recall curve on the AI & Analytics Engine

 

Wrap-Up

In this blog, we implemented the decision tree classifier first in Python using the sci-kit learn library and then on the AI & Analytics Engine. With Python, you need to have knowledge about data cleaning, data visualization, and machine learning to implement the algorithm. On the contrary, AI & Analytics Engine does everything for you with the click of a few buttons. What can take hours in Python, only takes a few minutes on the AI & Analytics Engine.

Not sure where to start with machine learning? Reach out to us with your business problem, and we’ll get in touch with how the Engine can help you specifically.

Get in touch

 

Similar posts