PIEXCHANGE
 
 

Machine Learning Classification Algorithm: Decision Trees [Tutorial with & without code]

In this blog, we will implement one of the most common classification algorithms in machine learning: decision trees

We will first implement it in Python using the sci-kit learn library. This requires experience in using Python as a scripting language. Then, we will implement it on the AI & Analytics Engine which requires absolutely no programming experience. It is a drag and drop data science and machine learning platform for business users to data scientists - who want to quickly and seamlessly automate an accelerate insight extraction from data (with no prior coding experience required).

Don’t believe what you just read? You can get started with the platform without paying a dime.

About the Dataset

The dataset used contains a list of numeric features such as job, employment history, housing, marital status, purpose, savings, etc. to predict whether a customer will be able to repay a loan.

The target variable is ‘good_bad’. It consists of two categories: good and bad. Good means the customer will be able to repay the loan whereas bad means the customer will default.

It is a binary classification problem meaning the target variable consists of two classes only.

This is a very common use case in banks and financial institutions as they have to decide whether to give a loan to a new customer based on how likely it is that the customer will be able to repay it.

Decision Trees: background knowledge

learning machine learning without coding

 

A decision tree is a supervised machine learning algorithm that breaks down a data set into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes.

A decision tree can handle both categorical and numerical data. Decision trees are the building units of more advanced machine learning algorithms such as boosted decision trees and hence serve as a baseline model for comparison for tree-based algorithms.

Hyperparameters for Decision Tree

The performance of a machine learning model can be improved by tuning its hyperparameters. Hyperparameters are those parameters that the user has to set in advance. They are not learned by the data during training.

Some of the most common hyperparameters for a decision tree are:

Criterion: this parameter determines how the impurity of a split will be measured. Possibilities are ‘gini’ or ‘entropy

Splitter: how the decision tree searches the features for a split. The default is set to ‘best’ meaning for each node, the algorithm considers all the features and chooses the best split. If it is set to random, then a random subset of features will be considered. The split will be made by the best feature within the random subset. The size of the random subset is determined by the ‘max_features’ parameter.

Max_Depth: this determines how deep the tree will be. The default is none and this often results in overfitting. The max_depth parameter is one of the ways in which we can regularize the tree, or limit the way it grows to prevent over-fitting.

Min_samples_split: the minimum number of samples a node must contain to consider splitting. The default is set to 2. This again is another parameter to regularize the decision tree.

Max_features: the number of features to consider when looking for the best split. By default, the decision tree will consider all available features to make the best split.

Python Implementation

Let’s start by implementing a decision tree classifier on the dataset using Jupyter Notebook

Importing the Required Libraries

A list of all the libraries required to read the data, divide into train & test data, build and evaluate the decision tree, optimize the hyperparameters using grid search, and plotting the ROC and Precision-Recall curves.

learning machine learning without coding

Reading the Dataset

automated machine learning platform

Viewing the Dataset

Let’s view the first three rows of our dataset

learning machine learning without coding

 

What’s the Target/ Response/ Dependent Variable?

Let’s have a look at our target variable ‘good_bad’

learning machine learning without coding

What are the dimensions of our dataset?

learning machine learning without coding

Our dataset consists of 1000 observations and 20 columns. Hence, we have 19 features and ‘good_bad’ is our target variable.

Getting a ‘feel’ for our Data

It’s always a good idea to explore your dataset in detail before building a model. Let’s see the data types of all our columns

learning machine learning without coding

We have 18 columns that are of type integer, 1 column that is of type float, and 1 column(our target variable ‘good_bad’) that is of type object meaning categorical.

Check for Missing Values

Always begin by identifying the missing values present in your dataset as they can hinder the performance of your model.

learning machine learning without coding

The only missing values we have are present in the ‘purpose’ column.

Check for Class Imbalance

When solving a classification problem, you should ensure the classes present in the target variable should be balanced i-e there should be an equal number of examples for every class. A dataset with imbalanced classes results in models that have poor predictive performance, especially for the minority class. In many models, the minority or the extreme class is much more important from a business perspective and hence this issue needs to be taken care of while preparing the data.

An example would help us illustrate our point better.

There are two customers A and B. A has a good credit history and if given a loan would be able to repay the loan. However, our model predicts A to be a ‘bad’ customer, and hence he is refused a loan. This does hurt the bank since they lost a customer who would have been able to repay the loan and the bank could have made a profit.

Now let’s consider customer B. B has a bad credit history and files an application for a loan. In reality, B would not be able to repay the loan. However, the model predicts b to be a ‘good’ customer, and hence his loan application is approved. This is much more damaging for the bank since they would not be able to recover the loan.

These two examples illustrate why having balance classes can be so important from a business perspective.

artificial intelligence and machine learning for beginners

Although the number of examples for good is slightly more than bad, it is not a severely imbalanced dataset and hence we can proceed with building our model. If for example, we had 990 examples for good and only 10 examples for bad, then that would have meant that our dataset is highly skewed and we should balance the classes.

Impute Missing Values

We begin by separating the features into numeric and categorical. The technique to impute missing values for numeric and categorical features is different.

For categorical features, the most frequent value occurring in the column is computed and the missing value is replaced with that.

For numeric features, there exists a range of different techniques such as calculating the mean value or building a model with the known values and predicting the missing values. The Iterative Imputer used below does exactly that

artificial intelligence and machine learning for beginners

 

artificial intelligence and machine learning for beginners

The missing values have been imputed for the ‘purpose’ column. Now we have zero missing values in our dataset.

Separating into Features & Target Variable

artificial intelligence and machine learning for beginners

X has all our features whereas y has our target variable.

Splitting into Training & Testing Data

artificial intelligence and machine learning for beginners

Fitting the Decision Tree Model

artificial intelligence and machine learning for beginners

The decision tree model has been fit using the default values for the hyperparameters since we did not specify any hyperparameters yet. We will do that in the next section.

Feature Importance & Feature Selection

Decision trees are widely used to do feature selection. When you have many features and you do not know which one to include, a good way is to use decision trees to assess feature importance even though you might be using a completely different machine learning algorithm for your model.

can i learn machine learning without python

From the plot, we can see that the most important feature in predicting whether a customer will repay a loan or not is the amount, followed by age, purpose, duration, and so on. The least important features that do not contribute much to the target variable is foreign, depends, coapp, resident.

To build a model with better predictive performance, we can remove the features that are not significant.

Making Predictions on Test Data

can i learn machine learning without python

Plotting the Confusion Matrix

can i learn machine learning without python

Viewing the Evaluation Metrics

can i learn machine learning without python

Tuning the Model through Hyperparameter Optimization

can i learn machine learning without python

Performing Grid Search on the Hyperparameters

can i learn machine learning without python

 

can i learn machine learning without python

Viewing the Best Hyperparameters

can i learn machine learning without python

Making Predictions on Test Data with our Optimized Decision Tree Model

can i learn machine learning without python

Plotting the ROC Curve

can i learn machine learning without python

Plotting the Precision-Recall Curve

can i learn machine learning without python

 

can i learn machine learning without python

Evaluation Metrics for the Optimized Decision Tree Model

can i learn machine learning without python

 

can i learn machine learning without python

If you compare these new metrics with the old ones before the hyperparameters were tuned, you would see a significant improvement in precision and recall.

BAM!

And we are done with the Python implementation. That was quite a bit of code that you had to write and understand. For someone who is new to the realm of data science and machine learning, it might feel like a lot of information to digest at first glance.

The fear of not knowing how to code should not stop you from implementing data science.

We have built an all in one platform that takes care of the entire machine learning lifecycle. From importing a dataset to cleaning and preprocessing it, from building models to visualizing the results and finally deploying it, you won’t have to write even a SINGLE line of code. The goal is to make machine learning and data science accessible to everyone!

AI & Analytics Engine Implementation

Upload the Dataset

how to learn data science for free

Give your Recipe a Name

how to learn data science for free

Preview the Data

how to learn data science for free

Select the Target Variable

how to learn data science for free

 

Convert Columns into Categorical & Numerical Datatypes

The platform will automatically tell you which columns should be converted to categorical and which ones should be converted to numeric.

All you need to do is click on the + button right next to the suggestion, go to the recipe tab and click the commit button.

how to learn data science for free

 

how to learn data science for free

 

 

how to learn data science for free

 

Impute Missing Values

The platform automatically detects which columns have missing values and suggest you to impute the missing values for them.

As we saw earlier, the only column that has missing values is ‘purpose’

how to learn data science for free

Feature Importance & Selection

In Python, we had to write code and draw a bar plot to identify the most important features.

The AI & Analytics Engine makes use of AI to predict which columns are not significant in helping the model learn and hence suggests the user to remove the non-predictive columns.

how to learn data science for free

 

how to learn data science for free

Click on the commit button to implement the suggestions.

how to learn data science for free

Click on the Finalize & End button to complete the data wrangling process.

how to learn data science for free

Visualize the Data

The platform generates easy to understand charts and histograms for both the numeric columns and categorical columns to help you better understand the data and its spread.

how to learn data science for free

Build a Decision Tree Model

Click on the + button and click on New App

machine learning automation tools

Give your application a name and since it’s a classification problem, select Prediction and choose the target variable from the dropdown menu.

machine learning automation tools

Click on the + button as shown below and select Train New Model

machine learning automation tools

Select Decision Tree Classifier from the list of algorithms. As you can see, the platform includes all the machine learning algorithms that you can use without writing even a single line of code.

machine learning automation tools

 

You can either select the Default configuration or the Advanced Configuration where you can tune your model by optimizing the hyperparameters.

machine learning automation tools

 

A list of hyperparameters for the Decision Tree Classifier that we also optimized using GridSearch in Python.

machine learning automation tools

Next, click on train Model and the platform will start the process of training your Decision Tree Classifier.

Our model has now been trained. It took about 5 minutes since we selected the advanced configuration in order to optimize the hyperparameters.

machine learning automation tools

Model Summary

machine learning automation tools

It’s now time to see how well how the model has learned from the data. For this, the platform has inbuilt capabilities to generate evaluation metrics such as confusion matrix, classification report, ROC curve & PR curve.

Confusion Matrix

machine learning automation tools

Precision, Recall & F1 Score

machine learning automation tools

ROC Curve

machine learning automation tools

 

Precision-Recall Curve

machine learning automation tools

 

Wrap-Up

In this blog, we implemented the decision tree classifier first in Python using the sci-kit learn library and then on the AI & Analytics Engine. With Python, you need to have knowledge about data cleaning, data visualization, and machine learning to implement the algorithm. On the contrary, AI & Analytics Engine does everything for you with the click of a few buttons. What can take hours in Python, it only takes a few minutes on the AI & Analytics Engine platform.

automated machine learning platform

 

If you're after a more in depth tutorial you can check our our knowledge hub. Want to jump in and get started on data of your own? You can... right now. 

Free Trial

 

use_case_img
A Friendly Introduction to the Most Common Terms in Data Science
March 3, 2021

Data Science can be an intimidating field, especially for beginners. There are so many terms thrown around and so much mathematics...

Read more
use_case_img
Is there a faster way to clean data?
January 3, 2021

People tend to think of data scientists as these ultra-smart, superhuman beings who are always applying some out of the box machin...

Read more
use_case_img
Getting Started with Data Science Without Any Coding
December 14, 2020

 Is the fear of learning how to code keeping you from getting started with THE most in-demand field of the 21st century?

Read more