Data Cleaning

Why We Convert or Cast Data Types


A data type is one of the fundamental concepts that any computer science student would learn at the beginning of their computer programming education.

To define it simply, a data type is an attribute that communicates to the compiler how the programmer intends to use the data at runtime (when the code has been run). Each data type has a unique definition of what operations may be performed on it, what it means, and how it is stored.

In the world of programming, there are hundreds, if not thousands, of programming languages that exist already - talk less of the hundreds being created each day. Almost every single programming language has an explicit notion of a data type, but some languages may use different terminology. The common data types include: 

Well, why does any of this matter to you? Great Question!

When you import data into the AI and Analytics Engine - Check out Tutorial 2/4 from the AI and Analytics Engine tutorial series for more information on how to import data - the data types of each feature (column) are automatically assumed. This means each feature would be constrained to certain operations depending on the way it's stored by the Engine. This may potentially raise further issues down the line when we attempt to perform some feature engineering if we do not address it. Therefore, as part of the data cleaning and data wrangling process, casting data to specific types is an important step. 

Say, for example, we have a feature named gender.  Let's assume that 1 denotes a male and 0 denotes a female. The Engine would read this data type as type numeric, but this truly isn't the case. Going a step further, if we had a list of features and the goal was to predict the gender of a person given those features, the Engine would interpret this problem to be a regression style supervised learning problem [given the gender feature is compiled as a numeric data type]. This doesn't make sense because if our regression model outputs 0.5, does that mean the instance is half-man/half-woman?

Cast & Convert 

In Structured Query Language (SQL) - a domain-specific language used in programming and designed for managing data held in a relational database management system, or for stream processing in a relational data stream management system [Source: Wikipedia] -  CAST and CONVERT are both used to change data from one data type to another; Seasoned SQL programmers would know that there are some slight differences between these functions but discussing them is beyond the scope of this article, and to be honest, not very important in my opinion. 

The AI and Analytics Engine provides us with some very useful functions to convert our data to different data types: 

  • Cast columns to categorical type
  • Cast categorical columns to text type 
  • Cast columns to datetime type
  • Cast columns to numeric type 
  • Convert numeric column to datetime

The naming conventions of each of the functions are quite self-explanatory, thereby making them quite easy to follow regardless of one's technical capabilities. 

Learn more about data cleaning here! Or if you want to know how to speed up your current process of attaining clean data, then this article is perfect for you. 

Wrap Up 

In this article, we learned that a data type provides information to the compiler on how a practitioner wants to use the data. Additionally, we learned about the CONVERT/CAST functions and why it's very important to ensure we've converted or cast our features to the correct data type so that our data can be correctly processed by a function.

Be sure to book a demo to find out how the team at PI.EXCHANGE can help you get your project over the line. 

Book a demo 

Similar posts