Data Preparation

Dealing with Complex and Nested Data


Often, datasets contain complex data types, such as nested JSON arrays. These values can be difficult to wrangle. This article presents some of the tools in the AI & Analytics Engine for working with them in a simple manner.

When datasets contain nested data, such as nested JSON arrays, it becomes more complex working with them “as is”. For example, when performing Exploratory Data Analysis (EDA), it may become difficult to examine required features that are nested deep within a JSON. Another example could be if we wanted to apply some predictive modeling over the data. To do so, we would first have to unnest these complex features and “flatten” our table in order to proceed with the modeling.

Due to the nature of the data, these tasks may require complex data wrangling actions. We will demonstrate here that handling nested JSON arrays and JSON objects is actually pretty simple and straightforward when using the AI & Analytics Engine's data wrangling feature. 

To show that, we’ll use the TMDB 5000 Movies  Dataset from Kaggle and the Engine data preparation feature.

Movie dataset analysis with AI & Analytics EngineAs can be seen, many columns such as genres, keywords, etc. actually contain an array of JSON objects. These columns will require some specialized handling.

For an easier way to handle and process text columns on the AI & Analytics Engine, check out this article!

 

What is the problem with complex or nested data?

Nested data adds complexity to the data wrangling process of a dataset. The reason is, that it is not easy to consume in its original form, in order to accomplish the data science tasks required for solving business problems.

For instance, let’s assume we want to create a predictive model. Toward that end, we would like to extract as much information as possible from our dataset. Thus, if our data contains nested JSON columns, we usually must unpack them in order to be able to use them.

Sometimes, it might even just be complicated to grasp the data at hand when you have complex JSON values within a single “cell” of a column in a dataset.

For example, a cell can contain the following value: (Modified example from here)

JSON nested data

This is for a single cell. Imagine if all cells in a specific column contain this data structure. At this point, it becomes very difficult to keep track of the data.

Thus, businesses need an efficient way to handle datasets that contain these types of nested data structures within cell values.

There are many approaches on how to deal with it. A few are:

  • Some platforms may allow the writing of custom code for unpacking. However, coding is cumbersome and excludes non-technical users.

  • Other platforms allow unpacking with a specialized UI during the data ingestion phase (only for JSON files). This method limits the other possible actions we might like to take, such as calculating the JSON array length or concatenating JSON columns.

Due to the above reasons (and more), the JSON manipulation in the AI & Analytics Engine occurs in the data wrangling phase and allows the user to wrangle the nested columns using the UI alone. (Bonus: No code writing necessary.)

JSON processing actions in the AI & Analytics Engine

To enable users to easily manipulate complex and nested data, the AI & Analytics Engine supplies numerous actions that handle JSON/nested JSON columns in its recipe editor. These include (among others):

  • Extracting individual attributes from JSON object columns.

  • Unpacking (exploding) arrays in a JSON array into multiple rows.

  • Manipulating/querying JSON arrays using formula actions such as: JSON_ARRAY_CONCAT, JSON_ARRAY_CONTAINS, JSON_ARRAY_LENGTH etc.

These actions allow us to extract the useful attributes from JSON objects, manipulate JSON columns into an ML “friendly” structure, get insights into the nested JSON data, etc.

A complex dataset processing workflow

As explained, one of the most common workflows is wrangling a dataset in order to create a predictive model. We will showcase how this can be done using the AI & Analytics Engine on the TMDB 5000 Movies dataset, which is a complex dataset containing nested values.

In our engine, it is performed using just a few actions:

1. Cast the required nested columns into JSON arrays.

2. Unpack these columns. (Results in JSON objects)

3. Cast the unpacked values to JSON objects.

4. Parse the JSON object columns. (Specifying the attributes to extract from the values in the JSON object column).

JSON Array Columns and Casting

JSON Array Columns - Unpacking and Casting

Parsed JSON values

The end result is that we have fully unnested all nested columns using just the UI alone. Recalling the intro, we stated that the unnesting actions could also be useful for other purposes, such as EDA.

For example, one might be interested in processing only movies that have at least some number of keywords in their description or compare the average ratings of movies in the comedy genre that have a spoken_language of English, compared to movies of the same genre in other languages, etc.

Using the JSON actions specifically, and the rest of the wrangling actions in the AI & Analytics Engine in general, the capacity for EDA with this dataset is basically endless.

Wrapping up

Complex datasets are common. They can be encountered across many business domains, such as marketing (Google Analytics), the motion picture industry (the example dataset in this article), and many more. Efficient, simple, and “codeless” JSON manipulation actions for wrangling these types of datasets are essential.
We have demonstrated parts of these capabilities using the AI & Analytics Engine.

Together with other wrangling actions in the platform and other product features, any user can enjoy an end-to-end workflow, from ingestion of data to deployments of predictive models, even for complex datasets, in a streamlined fashion and with ease.

Subscribe to get weekly updates on the Engine, and Machine Learning-related thought pieces!

Get in touch

 

 

Similar posts