Data Preparation

Processing Text Columns Made Easy: Splitting & Pattern Extraction

Do you spend hours preparing tabular data by splitting and extracting patterns from text columns? Five new recipe actions introduced in the new version of the AI & Analytics Engine now enable you to perform such tasks easily and efficiently.

Real-world tabular datasets often contain structured data embedded and hidden in text columns. Preparing this kind of data requires a combination of steps, such as splitting the text into multiple columns, rows, or arrays, extracting patterns, and capturing parts of it into multiple columns. This is often one of the most tiring and mundane tasks in data preparation. Data analysts and scientists solve such problems by writing custom code each time, spending hours in trial and error.

As the Data Science team at PI.EXCHANGE, one of the tasks we are entrusted with when building features for the AI & Analytics Engine, is to add new agents and actions into the Smart Data Preparation feature. This accelerates data preparation for our clients in a user-friendly, efficient, repeatable, and traceable way. In our recent release, we have added five new recipe actions for solving the aforementioned problem, allowing users to easily extract structured data from text columns, by splitting and pattern matching.

In this article, we will show you how you can use these new actions in your recipe. We will illustrate their use with typical problems occurring in real datasets. 


Multiple attributes stored in a single column

Individual attributes are often joined together with a delimiter, into a complex attribute, such as the “properties” column in a retail dataset, containing the type, the colour, and the size of each item:

Complex Attributes

One can extract these attributes into separate “item_type”, “colour”, and “size” columns using the "Split by delimiter into multiple columns" action, using the following simple steps:

  1. Search for the action named “Split…” in the “Add Action” panel of the recipe editor, and click on the action “Split by delimiter into multiple columns”. This takes you to the action editor.

  2. In the Action Editor, specify the following details:

    a. Choose the input column as the "Properties" column

    b. Specify the delimiter as the “_” character

    c. Add three output columns by clicking the “ADD+” button twice, and enter the names for the three output columns as “item_type”, “colour”, and “size”.

  3. Click “Add” to add the action to the recipe queue and to see a preview of the output.

Split by Delimiter into Multiple Columns Action

Split by Delimiter Action - Before and After

Additionally, this action also comes with optional parameters for greater flexibility:

Split by Delimiter - Optional Parameters

  1. Users can also split by a pattern expressed in the form of a regular expression (perl style).

  2. If the number of splits is more than the specified number of output columns, the rest of the splits are included in the final column as a single string that is not split, as seen in the example below. If the user doesn’t want this and wants to ignore the rest of the text after the split corresponding to the last column, they can do so by turning off the switch “Include tail of string in final split”. 

Split by Delimiter as REGEX - Before and After

Single column containing multiple tags per row

Certain entities can have attributes that contain multiple values. For example, the tags or keywords of each video in a dataset of popular YouTube videos:

Dataset of Popular Youtube Videos

One might be interested in only the videos containing a certain number of keywords from a certain domain.

There is no set number of tags for the whole column, as the number of tags can be different for two different videos. Hence, we cannot use the “Split by delimiter into multiple columns” action. Instead, the platform provides the “Split to array” action to solve this. On the action editor, we will use a regular expression pattern this time, so that any spaces surrounding the pipe character ("|"), will be discarded in the result. The pattern we are looking for is \s*\|\s*, which stands for <any-spaces><pipe-character><any-spaces>. Note that this action replaces the original column with a single column, whose data type is indicated as JSONArray:

Split to Array by Delimiter Action

Split to Array by Delimiter - Before and After

We can then proceed further by applying other recipe actions. For example, one may be interested in adding a column indicating the number of keywords in each video. To do this, add the action “Create a new column with a formula”, and enter the formula “JSON_ARRAY_LENGTH(tags)”. Name the output as “tags_count”:

Create New Column with Formula - Action and Result

Using other actions, one can also process the result of splitting the "tags" column in other interesting ways:

  1. Add a column showing which of the top 20 most frequent keywords figure in each video.

  2. Discard rows that have none of the top 20 most frequent keywords.

Read how you can apply actions to multiple columns at once on the AI & Analytics Engine!

Extracting special tags from a text column

Sometimes, splitting by delimiter to get the tags is not appropriate. For example, we may want to look at the description of each video and extract the hashtags in them. For such cases, we have the “Extract all matches of a regular expression” action, where the user can specify an input column and a pattern, to produce an output column containing the parts of the text that match the pattern, as shown below. The pattern we are after here is #\w+ which stands for <'#' character><one-or-more-alphanumeric-or-underscore>:

Extracting Special Tags from a Text Column

Extract Tags from Text Column - Input and Output

Similar to the previous section, one can then further process the JSONArray column resulting from the action.

Capturing parts of a pattern match into different columns

Some datasets contain text columns with useful data embedded within them and exhibit a certain pattern. One would need to match these patterns and extract parts of the match. Consider this example:

Example column of duration in years and months

We have a column where the values are of the form “X yrs Y mo”, where X and Y are the year and month values respectively. To extract “years” and “months” as separate columns, we can now use the “Extract capture groups” action. We specify a regex pattern with two capturing groups, enclosed in parentheses, corresponding to the year and month columns, and add two output columns (Note: If the number of output columns is mismatched with the number of capturing groups in the pattern, a validation error will be raised to the user as a dialogue window, and they will have to edit the action appropriately):

Extract Capture Groups Action

We will then have separated out “years” and “months” into separate columns, as follows:

Extract Capture Groups - Result

The regex pattern used in the above example is (\d+)\s*yrs\s*(\d+)\s*mo , which stands for <one-or-more-digits><any-spaces><literal value "yrs"><any-spaces><one-or-more-digits><any-spaces><literal value "mo">. The parentheses around the 2 <one-or-more-digits> part, identifies the groups we want to capture.

For this action, the data type of the output columns will always be Text , but one can apply the “Cast columns to numeric type” to the resulting columns to make them numeric.

Separating entities from a single row into multiple rows

Consider a dataset of restaurants, where the location of the restaurant is given along with the cuisines included in its menu:

Dataset of Restaurants

We would like to transform this into a table where for each city, we know how many restaurants cater to each type of cuisine.

This requires a reshaping of the data after splitting the "cuisine" column by the delimiter pattern “<spaces><comma><spaces>”, followed by a group-by and pivot operation.

The “split by delimiter and unpack to rows” action brings the ability to perform the first step. We will have to set the “Treat delimiter as regex pattern” optional parameter to “Yes”, and enter the regular expression \s*,\s* corresponding to the <spaces><comma><spaces> pattern. The output “position” column is not important for this use case, but it is useful in case we want to keep only the first N cuisines. The action configuration and the result are shown below. Note that each row produces multiple rows, one for each cuisine the restaurant serves:

Split by Delimiter and Unpack into Rows Action

Split by Delimiter and Unpack into Rows - Before and After

We can then use the “Reshape dataset into a pivot table” action, grouping by "city" and pivoting on the "cuisines" column, restricting cuisines to a few important ones. The full recipe and the result is seen below:

Reshape Dataset into a Pivot Table Action

Wrapping up

We have seen how the new recipe actions enable our users to perform multiple kinds of data preparation steps frequently needed in the real world. As can be seen from the above examples, all of these complex and time-consuming ways of preparing unstructured data can be achieved with a few clicks, in a repeatable and maintainable manner. It can also be shared with others in the team who are not familiar with coding styles and software tools typically employed by data scientists.

Not sure where to start with machine learning? Reach out to us with your business problem, and we’ll get in touch with how the Engine can help you specifically.

Get in touch

Similar posts