Blowing out project timelines and adding a little grey to your hair along the way. It is no surprise then, that Amazon Web Services (AWS) recently announced the release of their new data-preparation tool, AWS Glue DataBrew. We wanted to provide a comparative look at this new option in market and identify the similarities and differences with the AI & Analytics Engine’s Smart Data Preparation feature.
So if you are looking for tools to help claw back time from data preparation, read on...
First, a little background on both options.
AWS’s new data preparation tool belongs to the category of no-code easy-to-use visual data-preparation engine. It is primarily purposed as a tool for cleaning, normalization, and profiling of data as well as automation of recipe jobs. It can be viewed as a standalone tool for data preparation but can be integrated within AWS’ ecosystem of tools and services such as S3 or other AWS data lakes and databases for storage, import, and export of unprepared/prepared data.
We decided to try out the platform to develop this article. Below is a screenshot of the ASW Glue DataBrew graphical user interface (GUI).
Smart Data Preparation (The AI & Analytics Engine)
Within the AI & Analytics Engine, Smart Data Preparation is a fully interactive feature that lets users prepare their data at scale in a flexible manner.
The premise is that the user is guided by smart recommendations during the recipe-creation process. It offers a variety of “actions” (data transformation steps) through an action catalogue that offers the ability to fully customize and edit their data-preparation recipes. It covers the four stages of data preparation commonly required in analytics and machine learning tasks:
- Feature engineering
Rather than a standalone tool, the Smart Data Preparation feature is a tightly integrated functionality within the end-to-end user journey on the AI & Analytics Engine platform. It sits between the data import and app creation phases of the journey in a unified graphical user interface (GUI).
Walkthrough of Smart Data Preparation Through to Model Selection & Deployment
We have detailed the simple steps to prepare a model starting with the Smart Data Preparation feature. For a more in-depth walkthrough take a look at a recorded demo below.
- Import Data
2. Smart Data Preparation
- Examine recommended actions, analysis
- Customize and commit the actions
- Finalize recipe
3. Inspect Statistical Profile of Finalized Dataset
4. Create app (Target variable selection, train/test split)
5. Select features
6. Select models to train
7. Deploy trained models
Key Benefits: Its what you DON’T need to do
Smart Data Preparation on the AI & Analytics Engine provides ease of use through the above seamless interface. In particular, there is:
- NO need to manually generate access tokens and turn on/off secure access of data between different tools
- NO need to manually set up S3 buckets or migrate all of their data into a single ecosystem such as AWS
- NO need to write scripts to manually orchestrate different components of the end-to-end pipeline
The main similarity between the two tools is that they both cater to the need for an easy-to-use no-code interactive data preparation tool. Let's take a closer look:
Data Ingestion and Outcome
Both tools are targeted at tabular data files in the following formats: CSV, JSON (lines), and PARQUET.
While Data Brew allows the import of data mainly from data lakes on the AWS cloud such as S3, Redshift, and RDS — The Smart Data Preparation (AI & Analytics Engine) allows a diverse set of options for importing data, such as HTTP (URLs), SQL and NoSQL databases. Users can still upload data from cloud storage services such as S3 or GCS (google cloud) by getting a pre-signed URL for their dataset and using the HTTP option.
The outcome of the recipe-building process is a re-runnable recipe that can be re-used to transform a larger dataset of the same input schema.
Interactive Recipe Building and Recommender
Both AWS Glue DataBrew and the Smart Data Preparation feature have an interactive recipe-building user interface. There are many similarities between the two:
- Quick preview of the dataset being prepared
- The list of actions (steps) selected so far in the recipe
- The ability to edit or delete a step in the recipe
Validation of Recipe
Whenever the list of actions in a recipe is modified, a validation check is applied to ensure that the recipe is legitimate. It consists of the following among many checks:
- Input columns are available in the schema and are of the correct data types
- The combination of parameters is valid
- Names of new columns output by an action are not colliding with existing column names
Diversity of the action catalogue
Data Brew’s official documentation of “Recipe actions reference” show about 170 actions in their catalogue. Comparably, we support 85 actions + 81 formula functions in our Smart Data Preparation feature.
The coverage areas of the actions also differ between the two platforms, as shown in the table below:
Column group as output to complex actions
Some actions such as “pivot”, “Extract PCA components” etc. result in a “column group” rather than a column. A column group serves as a placeholder in the schema wherein one or more component columns can be generated by a recipe action. This enables:
- Better understandability, since the user is aware that the individual components within the group are generated by a single action.
- Universal validity of the recipe for all future batches of data, since running actions of the aforementioned types on different batches of data can lead to different number of individual components. For example, if a “pivot” action is run on a different batch of the data, the number of columns produced in the output can differ.
Advanced column selectors
Our Smart Data Preparation offers the ability for users to apply a transformation on multiple columns with a single action. To aid these, our graphical user interface and API provides the following modes of including/excluding input columns, column groups, or both into the selection:
- By Name: explicitly list out the name
- By Type: select by schema type
- By Pattern: select by match in regex pattern of the name
The user is also allowed to combine multiple such criteria with the and/or operator. This enables full flexibility to let the user specify complex selection criteria such as “columns matching the name pattern ‘x_.*’, excluding non-numeric columns.”
Queuing and Committing actions to a recipe
In The AI & Analytics Engine, whenever an action is added, it is “queued” to be committed to the recipe. Like in DataBrew, the queued actions are run on a fixed-size sample (first 5k rows) of the full dataset and the result is displayed as a preview. This serves as a visual feedback to users, allowing them to change the configuration of their action to ensure that the preview represents what they desire as the result.
Our platform also provides an additional functionality enabled for the user, called the “committing” of the queued actions to a recipe, before continuing to edit the recipe further. Committing of actions signals the platform to run the recipe actions on the full dataset (rather than on the sample) and then show a renewed sample preview. This results in a more accurate data preview, where the sample was first generated from the raw data before the recipe actions were applied.
Committing of actions to a recipe also provides the platform to run intelligent algorithms with the full processed data to to provide users with good recommendations for the next set of actions in their recipe.
In DataBrew, recommendations are generated on a per-column basis, and are not available by default. The user needs to click on a particular column and request recommendations corresponding to one column.
The AI & Analytics Engine’s Smart Data Preparation also provides recommendations of the next set of actions likely to be helpful to the user.
The key differences are that:
- These recommendations are generated automatically the first time the user starts a recipe and after every commit.
- The recommendations are based on a sample of the entire dataset rather than a single column. This makes it quite attractive for users who want to detect for example, input columns with too little correlation with the target variable that need to be dropped.
- Every recommendation is accompanied with reasons why these actions are useful, showing charts and summary statistics to help the user understand and scrutinize their data.
The pricing structure for both options is very different, keep in mind that you purchase DataBrew as a single tool and The AI & Analytics Engine as an end-to-end toolchain.
AWS Glue DataBrew: with AWS Glue Databrew pricing calculated on an hourly rate billed per second with additional costs based on tasks and region. It is really contextual to how you would use the platform so best check it out here: https://aws.amazon.com/glue/pricing/. You can also use their calculator.
"For the AWS Glue Data Catalog, you pay a simple monthly fee for storing and accessing the metadata. The first million objects stored are free, and the first million accesses are free. If you provision a development endpoint to interactively develop your ETL code, you pay an hourly rate, billed per second."
The AI & Analytics Engine: There are four subscription tiers to cater to individual data users all the way through to enterprise options. A free trial for The AI & Analytics Engine is available currently for 12 weeks. For more information you can check out PI.EXCHANGE’s AI & Analytics Engine pricing. The prices start from $129 USD p/m.
If you are after a tool to hasten the data preparation stage, of the data science process. Both option will assist in this endeavour. However there are differences to consider that may mean that one option may fit your need better than the other. The key differences are;
- Utility of an integrated tool-chain: Smart Data Preparation is an integrated feature. This mean that unlike DataBrew — you can seamlessly prepare, build and deploy. This benefits those that have prepared the data for downstream ML purposes to jump straight into the next step.
- Diverse options for data import: Smart Data Preparation has a diverse set of options for importing data, with DataBrew mainly allowing import of data from data lakes on the AWS cloud such as S3, Redshift, and RDS. If you store your data with AWS this is not an issue.
- Similar action amount — different action coverage: DataBrew has slightly more actions at 171, whilst Smart Data Preparation supports a 85 actions + 81 formula function. So, understanding the type of actions useful to you, given your data, and task at hand is key.
- Advanced column selectors: Smart Data Preparation offers the ability to apply a transformation on multiple columns with a single action. DataBrew does not this is an issue when working with large datasets.
- Advanced recommendations: Whilst DataBrew requires users to click on a particular column and request recommended actions, Smart Data Preparation provides the recommendations of the next set of actions likely to be helpful to the user. These recommendations are generated automatically when the user starts a recipe and after every commit. The recommendations are based on a sample of the entire dataset rather than a single column (like DataBrew). Every recommendation is accompanied with reasons why these actions are useful, showing charts and summary statistics. The benefit is you get a deeper understanding of your data, with a greater understanding as to why the platform has made said recommendations.
If you would like to trial The AI & Analytic Engine for yourself to better understand how the Smart Data Preparation tool can help speed up your data preparation at PI.EXCHANGE we offer a free trial of The AI and Analytics Engine.