What is a time-based split?

This article outlines the concept of a time-based split.

Time based-split is a method for splitting ML-ready data into train and test sets. It differs from random split because it uses time-index information to generate the splits as 2 consecutive (back to back) time periods. The earlier period is used for training while the later period is used for testing.

Background

The main purpose of for using a time-based train-test split is to simulate the real-world scenario where one trains their model using past data (e.g., customer behavior), and subsequently uses it to make predictions on future, unseen data on an ongoing basis according to a schedule. This reflects the typical use case for time-based prediction problems, where the goal is to predict future values based on past observations. A real-world example can be: identifying potential customer churn on a on a fortnightly basis.

There are a few key reasons why a time-based train-test split is necessary in such scenarios:

Temporal dependency: In time-sensitive data, observations are ordered chronologically, and there is often temporal dependency present. The current value may depend on past values, or exhibit certain patterns over time. By using a time-based split, you ensure that your model is trained on past data, and evaluated on future data, imitating the temporal relationship in the real world.
Avoiding data leakage: Data leakage occurs when information from the future is mistakenly used during model training, leading to over-optimistic performance estimates. If you randomly shuffle the data or use a traditional random train-test split, you may introduce data leakage because the model can learn patterns from the future that it shouldn't have access to during training.
Assessing generalization: The goal of machine learning is to build models that generalize well to unseen data. By using a time-based split, you can assess how well your model performs on future data that it has not seen during training. This evaluation gives you a more accurate estimate of the model's performance in real-world scenarios.

Example train-test split on the Engine

The screenshot below shows how a typical time-based split works on the Engine. Given a column of datetime values (timestamps) in your dataset (in the example: snapshot_time) , you reserve the last X periods (in this case, 4 weeks) of data for evaluation, and you train the model based on the rest of the data.

Example definition of a time based train-test split