What is a Machine Learning-ready Dataset?

An ML-ready dataset is one that can be analyzed by ML algorithms. To be ML-ready, raw data has to go through the data preparation process to transform it into data that can be understood by ML algorithms.

The goal of the data wrangling/preparation step is to get the tabular datasets to be in a Machine Learning-ready state. A dataset is considered to be ML-ready if it satisfies the following conditions:

  • For supervised models, there must be a target column (e.g. Defaulted on Home Loan or not). Each row of the target must include what needs to be predicted.

  • There is a list of columns called features which are used to predict the target (for supervised learning). Some examples of features include monthly income, age, occupation, etc.

Additionally, a machine learning-ready dataset should follow the tidy principle which states that:

  1. Each variable must have its own column.

  2. Each observation must have its own row.

  3. Each value must have its own cell.

Dataset Example:

For example, let’s say we have a dataset containing several months worth of transactions from a cohort of customers, with every single transaction stored in separate individual rows. Using this dataset, we want to predict whether a customer will have transactions exceeding a total of $5000 in the following month.

Not ML-Ready:

  • The data corresponding to each customer is spread out over several rows.

  • At the time of prediction, an aggregated history of transactions must be submitted to the model and the features of the single customer (such as the average spend on different days of the week, the average spend on different types of items, etc.) must be computed and used to make a prediction.

  • The target column needs to be created as the result of an aggregation.

ML-Ready:

  • Dataset of customers, with each row containing information on one unique customer.

  • The dataset contains the target column and input columns such as age, gender, the average spend on different days of the week, average spend on different types of items, frequency of spend, etc.