People tend to think of data scientists as these ultra-smart, superhuman beings who are always applying some out of the box machine learning algorithms to build sophisticated models that make things like self-driving cars possible.
You may be surprised to learn that data scientists could be spending only 20- 30% of their time doing that. 70-80% of the time might be spent on data cleaning and wrangling.
Yes, it may sound like an exaggeration but that is the reality. Your model is only as good as the data it receives. Hence, it is extremely important for data scientists to invest a huge portion of their time in cleaning and tidying up the data and formatting it to get the best possible results.
Data Cleansing- A Formal Definition
It’s always a nice idea to begin by defining the key terms that an article aims to address. So here’s a very simple and sweet way to understand what data cleansing means in the context of data science analytics:
“The process of identifying incorrect, irrelevant, incomplete parts of a dataset and then taking measures to ‘clean’ it so it is accurate, relevant, and consistent, and uniform which will help the model learn and predict better.”
The Steps & Techniques Involved in Cleaning Data
So what should you do to figure out if a dataset you are working on needs cleansing? There are a few standard data cleaning steps and techniques that you can follow to identify what needs to be done to tidy up your data and prepare it before you start building your model.
I will be using the following data to illustrate my point and formulate a data cleansing strategy:
1.Check for Missing Data
You should consider yourself extremely lucky if you get a dataset in real-life with no missing values. Since data collection is a cumbersome process that involves a high degree of human error, there will always be data missing in your dataset.
We have missing values in rows 1,2 and 3.
How to address the problem of missing data, you ask?
You can either drop the rows completely that have missing values but then there is a cost associated with that action. You may end up losing vital information that could have helped the model to learn better.
If your dataset has only a few missing observations, then this choice could be a reasonable approach but if your data has a lot of missing values, then removing them all would leave you with very little data to train your model.
The other option is to impute the missing values. This can be broken down into whether the observation is numeric or categorical. For numeric features, you can calculate the mean and the median value and use that for the missing values.
In our example, Age is a numeric feature with a missing value.
For categorical features, you find out the most frequently occurring value and use that to fill in the missing values.
Sex, City & Qualification are categorical variables with missing values in our example dataset
2.Check for Outliers
An outlier is a data point that is significantly different from the other data points.
We have outliers in the Age & Height columns.
Natalie has an age of 130 which when compared with other values appears quite strange. And we know that it is very difficult for a person to live that long to reach the age of 130.
Similarly, Cecilio has a height of 8.4 whereas all the others have it between the range of 5.1-5.6. Again, this appears to be an outlier as well.
However, in real-life problems, it can become difficult to detect outliers and you usually need some statistical or visualization technique to detect them, depending on whether the feature is numeric or categorical.
3.Check for Duplicate Data
Real-life datasets are usually built by merging data from various sources. When this happens, there is a high chance of an observation getting repeated.
Row 0 and Row 5 contain exactly the same information hence they are duplicates. Hence, we should keep only one of the two rows and drop the duplicate.
If an observation has more than one occurrence in a dataset, it is a duplicate and it should be removed so we are only feeding useful information to train our model. This will help to improve the overall efficiency of our model and result in better predictions.
4.Check for Inconsistent Data
There can be other issues with the data such as inconsistent usage of Capitalization, wrongly assigning a categorical column to be numeric, or vice versa, wrongly formatted addresses, etc. Thus, you need to standardize the data to make it consistent and uniform, or else the model.
Messy Data will Mess Up your Head
Data cleaning can be tedious and mundane. As a data scientist, you will usually be dealing with datasets with thousands of rows and hundreds of features. This would require patience and resilience. The above data has only six rows and just by looking at it might give you an unpleasant feeling.
Is there A WAY somebody could do this Data Cleansing for YOU? The answer is YESSS!!!
Automate the Data Cleaning Process
The AI & Analytics Engine lets you do data science without coding. Yes, you heard it right. Now you can get started with data science without requiring any programming skills. Awesome, isn’t it?
Our goal is to help data scientists focus on delivering results by automating the extremely important yet mundane tasks such as data cleaning.
I’ll show you how the platform does all the data cleansing for you as you do just a few clicks, sit back, and relax.
I have used a dataset that contains 1000 rows and 81 columns with a lot of missing values, inconsistent data types. The goal is to predict the price of a house given a bunch of features. It’s a regression problem since the target variable is continuous but we won’t go into the details of that here since my purpose is to illustrate how you can do data cleansing effortlessly using this platform.
Here is a preview of the data after uploading it on the platform:
Initially, all the columns are of the type “Text”.
On the right side, there is a Suggestions box. Our first step is to select the target column which in this case is the Sale Price.
Now, the FUN part..
The data cleansing tool in the platform automatically detects which columns need to be categorical and which should be numerical. It displays the suggestions and all you have to do is click on those suggestions.
Now, the columns have been successfully converted into the appropriate data types.
Impute the Missing Values
The data cleansing tool in the platform will automatically detect which columns have missing values and will recommend you to impute the missing values for those columns.
The missing values for the column LotFrontage have been displayed below:
Remove Irrelevant Data
The data cleansing tool in the platform automatically detects which features are important for the model and which features are redundant. In data science terminology, it’s called feature selection/ feature extraction/.
Our Cleansed Dataset & Wrap-Up
We started with a dataset that had incorrect column data types, a lot of missing values, and a bunch of features, most of which were irrelevant in making predictions. And with just a few clicks, we end up with a dataset with the correct data types, imputing missing values, and removal of irrelevant features.
While this would have taken a lot of time if you were to do this using R or Python, the AI & Analytics Engine Smart Data Preparation Feature does it for you in a matter of seconds.
Liked what you saw?