Data Preparation

A User-Friendly and Improved Way to Prepare Time-Series Data


Do you often find preparing time-series data from raw sources complex and unmanageable? See how the latest release of our Engine can make this process accessible for you.

Time-series analysis and prediction often require data that are regularly spaced in time. Many real-world datasets for which time-series analysis is desired, however, consist of records of events that occur at arbitrary and irregular intervals. Common examples are datasets that include financial transactions, deliveries of orders, or check-ins of guests at a hotel. Hence, such datasets of irregular events often need to be processed in a suitable manner into regularly spaced time series. Often, the technique used by experienced data scientists to do this involves combining two steps, 1) resampling and 2) aggregation, as shown below:

Resampling with Aggregation

Resampling involves converting the time index to a different interval unit and then grouping and mapping the dataset to an ordered time sequence. Aggregation involves collecting all the events that occur within an interval and typically processing them into a single numeric value. The downside is that this process can get quite complex and unmanageable for datasets with many variables.

Here you will see how the AI & Analytics Engine can make this process simple for you with its AI-guided data wrangling feature. In the latest release, the Engine introduces a new recipe action specifically built for preparing a time series from irregular data. It aims to make this complex task user-friendly while offering the full set of functionalities at the same time. With this, you can accomplish many time-series data-preparation tasks of high complexity in a single step.

Data

To illustrate the Engine’s new time-series preparation action, let us upload the Online Judge Server Log dataset from Kaggle to the Engine and start a recipe. After some initial processing, it looks as shown below:

Processed dataset

Suppose that our objective is to explore the number of distinct URLs every hour. To achieve this, we need to deal with the following three issues:

  • There are some rows with duplicates in the time index but different values in the other columns (see Rows 6 and 7);

  • The time-index column is unnecessarily fine-grained (with precision in seconds);

  • The time-index column is not equally spaced. As seen below, the time intervals (indicated byTime_interval) have a very large variation (where Time_next is achieved by shifting Time with 1 step forward).

Time interval variations

Resampling Time Series

To resolve these issues, we need to reset the time-index column with an equally-spaced interval at an appropriate precision level. To achieve this, the Engine offers the new Resampling Data into a Regular Time Series action in the recipe-editor catalogue.

We can choose the aggregation function as “Approximate Count Distinct(Note: Counting the distinct elements is approximated based on HyperLogLog++ a variant of HyperLogLog for computational efficiency on large datasets.) and simply provide the information required by the UI as follows:

Approximate count distinction - Aggregation function

In general, we can conveniently set a time interval to be any value, say 15 minutes. We can also select a separate aggregation for each target column, from a comprehensive list of functions (read this article for more information on data aggregation functions provided by the Engine).

Once the action is added, the preview on the left panel is automatically updated, along with a short summary of the list of queued actions. As you can see from the preview, the dataset is now a regular time series of 1-hour intervals.

Processed dataset with time-series of 1-hour intervals

Once the recipe is finalized, we can use the processed dataset for a broad range of analytical and modelling methods. One such method is the STL decomposition, which decomposes the time series into seasonal, trend, and remainder components. Once the dataset is processed into a regular time series and the recipe is finalized, this is automatically computed by the Engine and shown on the dataset’s page. For this dataset, we can clearly see a visualisation of trend and seasonality as shown below:

Data visualisation of trend and seasonality

Wrap-up: 

As seen above, the Engine presents a convenient way of using a single recipe action to prepare time-series data. It also gives an unlimited set of possible ways to do this, by offering a flexible way to specify the sampling interval and the aggregation functions. With this, we can easily prepare our time-series data in any desired way that we see fit for a particular application. This is much simpler than a complex sequence of steps typically involved. With the processed time series, we can then proceed with most analytical methods and forecasting models.

 

Explore the Engine and its functionalities with a free trial!

Free Trial

Similar posts