Data Preparation

Introducing Aggregation Functions: Identify Patterns in Grouped Data


Ever wanted a feature to aggregate values on one or multiple columns in a large dataset, to be fed into a machine learning algorithm? Or summarizing values within a certain window? To solve this problem, we have introduced convenient and easy-to-use actions based on aggregation functions.

Many real-world datasets contain categories that data can be grouped into. In such cases, data aggregation operations are an important aspect of data wrangling and can make analyzing large datasets and modeling them using machine learning models more accurate. Without them, both the model and the user can misinterpret the data. However, writing code to perform these actions requires specific coding skills and can be very exhausting. 

Analyzing grouped data using the AI & Analytics Engine

How do you feel about having a platform that performs these operations in a few clicks? The AI & Analytics Engine includes three data wrangling actions for such operations. 

  1. Aggregate Columns within groups

  2. Reshape dataset into a pivot table

  3. Compute window functions

To facilitate each of these actions,  a rich set of aggregation functions is also available.

Step by step guide

We will use the dataset here that contains daily item sales data of 60 shops, to demonstrate the functionality of these actions.

First, we will upload the dataset to the Engine and do some pre-processing (including creating a new column total_sales) to make the dataset usable.

After pre-processing, the dataset looks like this:

Pre-processed dataset with total sales

Example 1: Aggregate Columns within groups

First, Let’s calculate the daily total sales for each shop using the “Aggregate Columns within groups” action. 

To do so, we select the action from the action list.

Aggregate Columns within Groups Action

Then we simply use date and shop_id as “group by” columns and total_sales as the aggregate column with “Sum (Numeric)” as the aggregation function and add them to the recipe.

Aggregate columns action

The Engine does all the calculations and returns the aggregated dataset with “group by” columns and an aggregated column: total_sales_per_shop. All unwanted columns in the original dataset like data_block_num, item_id, item_price, item_cnt_day are dropped.

Aggregate sales dataset

It is also possible to merge the new columns to the original dataset instead of dropping other columns with the Merge Aggregation Back” switch.

In general, we can add multiple “group by” columns and multiple “aggregate” columns with the same or different aggregation functions.

Example 2: Reshape dataset into a pivot table

As a second example, we are interested in calculating the total sales per shop in a different manner, where the aggregated value per shop is shown in a different column for each shop. We can do that by applying the “Reshape dataset into a pivot table” action.

Reshape dataset into a pivot table

Simply use date as the group by column, shop_id as pivot column, and total_sales as an aggregate column with Sum (Numeric) as the aggregation function.

Reshape dataset into a pivot table action

The output will then look like this:

Reshape dataset into a pivot table - output

The output contains a column group total_sales_shop (of 60 columns) where each column in the group shows the total sales for a particular shop.

Example 3: Compute window functions

In this example, we are interested in the total for the past 10 days sales per shop on each day. We use the “Compute window functions” action for that. 

Compute window functions action

The figure below is a simple illustration of a window frame where the window function calculations occur.

Window function calculations

We will use the dataset produced by the “Aggregate Columns within groups” action (not the original dataset) for this example. 

In the “Compute window functions” action UI, we set shop_id as the partition by column as we are calculating the total sales for each shop. the date is set as the order by column in which we will calculate the window. We aggregate total_sales_per_shop using sum.

We further need to set the optional parameter to specify the window. We set “Window start” and “Windows end” to “Position relative to current row” with “Start position” and “End position” based on the number of rows. Both start and end positions are inclusive of the given row.

Compute window functions action

The resulting data frame will now have a new column with total sales for each shop for the past 10 days.

Compute window functions - output

To support these three actions, the Engine provides a wide range of numeric, text, date-time, boolean, JSON Object, and JSON array aggregation functions found here.

 

Wrap-up

In this article, we demonstrated the usefulness of operations that calculate windowed and grouped values. We also showed how these calculations could be performed using a non-coding platform, the AI & Analytics Engine, which required no prior coding skills, empowering technical and non-technical people to make efficient data-driven decisions.

 

Subscribe to get updates on the latest articles and learn how to get the most out of the Engine! 

Subscribe

 

Similar posts