Let’s suppose you have a dateset on which you want to apply a machine learning algorithm. You check the data types of all the features. It consists of numbers as well as text.
Can you just go ahead and feed the data directly into a machine learning model? The answer is NO!
As humans, we can understand all kinds of data be it numbers, strings, letters, or text of any kind. Machines, no matter how fast and intelligent they are, can only understand numbers. Categorical data are variables that contain label values rather than numeric values. They can be both nominal such as gender (male, female), or ordinal such as education level (Bachelor, Master, Ph.D.).
Hence, the need arises to convert textual data into numeric so it could be processed by the machines, this is called pre-processing.
KNOCK KNOCK..... Who is it? One-Hot Encoding
A big part of pre-processing data is called encoding, a process where each piece of data is represented in a way that the computer can understand. Whilst there are different ways of encoding such as Label Encoding, we will focus on One-Hot encoding. One hot encoding is a technique to convert categorical variables into numeric variables.
The categorical variable is encoded as a vector where only one element is 'hot' or non-zero. With one-hot encoding, a categorical feature becomes an array whose size is the number of possible choices or categories for that feature.
An Easy to Understand Example to Begin With
Who doesn’t like food? Let’ suppose we have a data frame with a Food column that consists of four food types: Chicken, Beef, Lamb, and Fish.
A machine learning model does not understand what is the difference between chicken and fish or between beef and lamb. Maybe, we can assign integer numbers to every category?
We used the map function to convert categorical data into ordinal. This is called Label Encoding.
Do you think there is a problem with the above approach? (Hint: YES, there is)
A machine learning model would think that since Fish has the highest integer value, it is the most important category whereas Chicken is the least important. We know each food type holds the same significance but assigning integers numbers to categories in the way we have done above can confuse the algorithm. While label encoding is pretty intuitive and straightforward but it has the disadvantage that the numeric values can be “misinterpreted” by the algorithms.
One-hot encoding resolves this issue by creating a column for every category and assigning a 1 if the category is present and 0 if it is absent.
By looking at the categories, we see the first column corresponds to the category Beef, the second column to Chicken, the third to Fish & fourth to Lamb. If we look at the first row above, there is a 1 in the second column which means it refers to Chicken.
Now that we have covered the basics, let’s apply our knowledge to a real-world data set.
Importing the Required Libraries
Reading the Dataset
The dataset has a total of 7043 rows and 20 columns.
Selecting the Categorical Columns
We have a total of 16 categorical columns in our dataset.
Counting the Different Categories of a Categorical Column
A customer can pay for the services he is using in four different ways: electronic check, mailed check, bank transfer, or credit card. Hence, there are four different categories present in the PaymentMethod column.
One-hot encoding will create a one-hot vector for each of the categories. What that means is that we will have four columns instead of one. Let’s see how this works:
Now, we convert the categorical ‘PaymentMethod’ column into numeric by applying one-hot encoding:
For each row in the dataset, we get a vector of four values where only one is a ‘1’ and the other three values are all ‘0’.
Which category does the ‘1’ refer to? The order of the categories is given by the following command:
Hence a vector [0,0,1,0] means the payment was made through an Electronic check since that is the third category as can be seen above.
Similarly, a vector [0,0,0,1] means the payment was made through a Mailed check.
As we stated earlier, one-hot encoding converts categorical variables into 0’s and 1’s which can be understood by a machine.
Applying One-Hot Encoding on all Categorical Columns
We first drop the customerID column since it does not provide a machine learning model with any useful information.
The resulting one-hot encoded dataframe:
As we can see above, all the categorical columns have been one-hot encoded into numeric columns which can now be fed into a machine learning model.
We had 15 columns that we had to one-hot encode and we end up with 41 columns. One-hot encoding generates one binary variable for each category. Consequently, if we have 15 columns and the total number of unique categories in all 15 columns equals 41, we would end up with 41 features.
Method 2: Dummy Variables
Another method to implement one-hot encoding in pandas is to use the pd.get_dummies() function. It is easier to implement than using the one_hot_encoding function. It basically achieves the same purpose but with less lines of code.
You just have to pass a data-frame and specify the name of the columns that you want to one-hot encode.
The resulting data-frame has both numeric columns and the one-hot encoded categorical columns whose names you explicitly specified earlier.
And that's how you deal with categorical data in Pandas!!!
Wish there was a Simpler Solution without any Code?
One-hot encoding can be confusing since you have to first extract the categorical variables and then transform them and also keep track of each category. This requires understanding of python and the sci-kit learn library. But what if we told you that there exists an infinitely simple solution to do data science that you can accomplish anything with the click of a few buttons.
One-Hot Encoding using the AI & Analytics Engine
Once you upload your dataset, this is how the data preview looks like:
Simply click on the ADD ACTION option appearing on the right and it will display a list of all functions that you can perform to clean, wrangle and transform your data. You can type one hot encode in the search bar.
Simply select the columns that you want to one-hot encode. You can also give a name to the output column like we have given. Once you have selected all the columns, click on ADD.
The platform will do all the processing for you and return you a data frame that includes the one-hot encoded columns.
In this article, we explored a very important concept in machine learning: dealing with categorical data and how to translate it into numeric so it could be understood by machines. Sci-kit learn library in Python provides a very effective method called one-hot encoding which we used to convert our categorical columns in the data set to numeric. The basic strategy is to convert each category value into a new column and assign a 1 or 0 value to the column depending on whether the value is present or missing respectively. We then did the exact same thing on the AI & Analytics Engine which does not require you to know any programming in Python. It’s a no code data science platform that empowers data scientists to automate their tasks and hence focus on delivering business value which is the ultimate end goal of every business.