Data Preparation

Join and Lookup: Working with Multiple Datasets on the Engine


Working with multiple datasets is a must in any serious real-world machine learning project. In particular, joining multiple tables together or looking up information from other tables are bread and butter tasks in data science.

Introduction

In the article, you will learn how to do joins and lookups on the AI & Analytics Engine platform.

A key advantage of using the Engine’s data wrangling ability is that you can easily construct the actions by choosing the columns you want with autocompletion and drop down; so you won't have to remember the column names and switch between data view and action view to look up the names of the columns. Also, because everything is validated, you will not make mistakes like trying to join columns of different types (e.g., a common mistake is to try and join a text column with a numeric column, and getting a potentially head-scratching error message). The Engine protects you from committing such errors by doing the validations for you and providing easy-to-understand error messages for common mistakes.

Here we will demonstrate the Lookup and Join Actions

Datasets Used

To illustrate the Lookup and Join actions: we will use these datasets:

  • A Band Members dataset containing famous bands and their members; and

A Band Members dataset containing famous bands and their members

  • A Band Member Instruments dataset containing the band member names and their primary instrument.

A Band Member Instruments dataset containing the band member names and their primary instrument

Lookup Action

Looking up information in another table is one of the most common data operations. Consider the Band Members and Instruments datasets mentioned above; for each band member, we wish to look up the instrument they play, each member only plays one instrument, so it’s a Lookup action.

Here is how the lookup works:

  1. Select the Lookup Action in the Add Action action catalogue.

    Lookup Action in the Add Action action catalogue.
  2. Select the table you wish to look up information from, then click Done. Datasets to join action

  3. Choose the columns to add from the other dataset. In this, I wish to bring across the plays column only.Columns to add from other datasets

  4. Sometimes, the column you want to add has a column name that clashes with an existing column in your data. In that case, you may want to use the prefix input boxes to disambiguate by giving the columns new column prefixes. At times, it’s perfectly ok to leave them empty. In this example, since no columns clash, you can safely leave them empty.prefix input boxes to disambiguate by giving the columns new column prefixes

  5. Sometimes, the lookup key may appear multiple times in the lookup dataset, i.e., there are multiple rows containing the same lookup key. In that case, the lookup action will only bring across one of those rows (this contrasts with the Join action covered below). So which row does it choose to bring across? You need to use the following options to order the dataset and to select either the first row or the last row according to that order.Column to order dataset by

     

Lookup Aggregation Action

Sometimes you may want to summarize the lookup dataset in some way before doing a Lookup. The Lookup Aggregation action can achieve this.

Lookup aggregation action on the AI & Analytics Enginee

The benefit of using lookup is that the user need not process the secondary dataset separately. All the aggregations can be performed in one action; see Introducing Aggregation Functions: Identifying Patterns in Grouped Data if you want to learn more about data aggregation functions and what’s available on the Engine.

You can choose from the provided aggregation functions when you choose to use the Lookup Aggregation action.

Lookup aggregation actions options

Join Action

The behaviour of the various types of Join actions should be familiar to anyone who has used SQL before. The four major join types are covered.

4 major join type actions

And the options are quite straightforward.

Join type action options

Again, the prefix options are available, if the column name clashes between the primary and secondary datasets.

Wrap-up:

In Summary, the AI & Analytics Engine has you covered for common operations like Joining and Lookup:

Action

 

What does it do?

 

Things to remember

 

Lookup

Lookup and return up to one row of values from the lookup dataset

Lookup can only return one row, so in the case where more than one row is available in the lookup from the dataset, the user has to choose whether to retain the first or last row ordered by some column chosen by the user

Lookup aggregate

Aggregate the lookup dataset before performing a lookup

Aggregation is performed in one action; no need to prepare the secondary dataset separately.

Join

Left Join, Right Join, Inner Join, etc

The four main joins: Left, Right, Inner, and Outer, are covered.

 

Subscribe for more articles on how to use the Engine in your Machine Learning projects!

Subscribe

Similar posts