Ever wanted a feature to aggregate values on one or multiple columns in a large dataset to be fed into an ML algorithm? With easy-to-use actions...
Join and Lookup: Working with Multiple Datasets on the Engine
Working with multiple datasets is a must in any serious real-world machine learning project. In particular, joining multiple tables together or looking up information from other tables are bread and butter tasks in data science.
In the article, you will learn how to do joins and lookups on the AI & Analytics Engine platform.
A key advantage of using the Engine’s data wrangling ability is that you can easily construct the actions by choosing the columns you want with autocompletion and drop down; so you won't have to remember the column names and switch between data view and action view to look up the names of the columns. Also, because everything is validated, you will not make mistakes like trying to join columns of different types (e.g., a common mistake is to try and join a text column with a numeric column, and getting a potentially head-scratching error message). The Engine protects you from committing such errors by doing the validations for you and providing easy-to-understand error messages for common mistakes.
Here we will demonstrate the Lookup and Join Actions
To illustrate the Lookup and Join actions: we will use these datasets:
A Band Members dataset containing famous bands and their members; and
A Band Member Instruments dataset containing the band member names and their primary instrument.
Looking up information in another table is one of the most common data operations. Consider the Band Members and Instruments datasets mentioned above; for each band member, we wish to look up the instrument they play, each member only plays one instrument, so it’s a Lookup action.
Here is how the lookup works:
Select the Lookup Action in the Add Action action catalogue.
Select the table you wish to look up information from, then click Done.
Choose the columns to add from the other dataset. In this, I wish to bring across the plays column only.
Sometimes, the column you want to add has a column name that clashes with an existing column in your data. In that case, you may want to use the prefix input boxes to disambiguate by giving the columns new column prefixes. At times, it’s perfectly ok to leave them empty. In this example, since no columns clash, you can safely leave them empty.
Sometimes, the lookup key may appear multiple times in the lookup dataset, i.e., there are multiple rows containing the same lookup key. In that case, the lookup action will only bring across one of those rows (this contrasts with the Join action covered below). So which row does it choose to bring across? You need to use the following options to order the dataset and to select either the first row or the last row according to that order.
Lookup Aggregation Action
Sometimes you may want to summarize the lookup dataset in some way before doing a Lookup. The Lookup Aggregation action can achieve this.
The benefit of using lookup is that the user need not process the secondary dataset separately. All the aggregations can be performed in one action; see Introducing Aggregation Functions: Identifying Patterns in Grouped Data if you want to learn more about data aggregation functions and what’s available on the Engine.
You can choose from the provided aggregation functions when you choose to use the Lookup Aggregation action.
The behaviour of the various types of Join actions should be familiar to anyone who has used SQL before. The four major join types are covered.
And the options are quite straightforward.
Again, the prefix options are available, if the column name clashes between the primary and secondary datasets.
In Summary, the AI & Analytics Engine has you covered for common operations like Joining and Lookup:
What does it do?
Things to remember
Lookup and return up to one row of values from the lookup dataset
Lookup can only return one row, so in the case where more than one row is available in the lookup from the dataset, the user has to choose whether to retain the first or last row ordered by some column chosen by the user
Aggregate the lookup dataset before performing a lookup
Aggregation is performed in one action; no need to prepare the secondary dataset separately.
Left Join, Right Join, Inner Join, etc
The four main joins: Left, Right, Inner, and Outer, are covered.
Subscribe for more articles on how to use the Engine in your Machine Learning projects!