1. The AI & Analytics Engine Knowledge Hub
  2. Walkthrough: Build from Scratch
  3. Classification & Regression: Train your own machine-learning model & make predictions

How to build a supervised machine learning pipeline?

This article is an end-to-end walk through of the “build from scratch” approach to build and use a supervised machine-learning pipeline. We will be using a banking use case as an example.

Watch the walkthrough video:

HubSpot Video

Introduction

The AI & Analytics Engine provides two fundamental ways for users to build their ML solutions and use them:

Templates primarily cater to business users who are not necessarily familiar with ML and data-science concepts. They build entire ML pipelines requiring only business-specific inputs. On the other hand, the “build from scratch” option provides analysts and data/ML practitioners with easy-to-use no-code tools to define, compose, and run custom ML pipelines to suit various business use cases. It enables such users to focus on domain knowledge and feature engineering rather than coding and engineering tasks.

In this article, we will consider a specific use case to walk through the process of building and using such a ML pipeline.

The use case

For the purpose of this walkthrough, let’s consider a scenario where your role is a data analyst/scientist in a banking institution, and your objective is to use data to optimize the loan-approval process. The main challenge is to distinguish between loan applicants who are likely to repay their loans on time, and those who are prone to default or delay their payments, by building a model that learns from historical transactions of loan applicants and their final outcomes. This is a binary classification problem, where the features are derived from the loan applications, the customer demographics and their historical transactions, with the target variable being the loan status (“no problem” or “debt”). To expedite the project delivery, the preferred approach is to utilize no-code tools that can eliminate the coding effort, the experimentation time on various methods and also handle the model deployment in production.

Datasets

The data used in this article are obtained from the Czech bank dataset, which you can download here.

The input data for this use case consists of the following tables:

  • loans: this table contains information about the loan applications, such as loan ID, customer ID, loan amount, loan duration, and loan status. The loan status is a categorical column with four possible values: A (finished, no problem), B (finished, debt), C (running, no problem), and D (running, debt). This column will be transformed into a binary column with two values: “no problem” and “debt”, which will be used as the target column for the classification problem.

  • clients: this table contains customer records containing the following attributes: gender, date of birth, and location.

  • transactions: this table of historical records of customer transactions. Each record contains the customer ID, transaction date, transaction amount, transaction type and balance.

image-20230705-110555Preview of the loans table

image-20230705-110611Preview of the clients table

image-20230705-110529Preview of the transactions table

Getting started with the App Builder

The first step is to login to the Engine, and ensure that you are in the right project in which you want to create the app:

Log in to the Engine, then ensure you are in the right projectStep 1: Log in to the Engine, then ensure you are in the right project

Choose “Build from scratch”Step 2: Choose “Build from scratch”


Next, we are prompted to choose two options. Here, we need to choose “Regression or classification” as the task we have for our banking use case is classification:

Choose “Regression or classification”Step 3: Choose “Regression or classification”


Then we need to provide the app an appropriate name to proceed. Here, we will leave it as the default app name, as we are able to change the name an time later from the “Settings” tab of the App.

Configure the app name (1)

Step 4: Configure the app name


Once we confirm the app name, we are taken to the “App builder” view. It contains the steps to prepare data, define what to predict (the target column and the problem type), select the features and then build models.

Adding datasets

In the first step, we add the datasets relevant to the use case and generate a single dataset containing the feature columns and the target column, using appropriate data-preparation steps. We can either add existing datasets in the same project that we previously imported, or import them new. Since we do not have any existing datasets, let us proceed with the “Import” option:

Import new datasetChoose “Import new dataset” to add datasets to the App Builder


Under the import options that show up next, we choose the “File upload” option and then choose the file that we want to upload, and proceed. Let us first choose the “clients” dataset for upload:
Choose the “File Upload” option
Choose the “File Upload” option
Data importChoose the clients.csv file

We can go to the next step, where we can see the preview and the schema after a very short wait:

image-20230920-063738Preview and configure step
 

The schema that the Engine has inferred looks satisfactory, especially since it has correctly detected the type of “date_of_birth” as “datetime” and all other columns except “gender” as numeric.

Upon confirming the creation of the dataset, we are taken to the app builder where we can see the added dataset:

Product Documentation Screenshots 2023Dataset added to “Prepare Data” step of the App Builder


We can examine the column stats and visualizations with the “View Analysis” icon or remove the dataset from this app (but keep it in the project) with the cross icon:

Quick access to data analysis resultsQuick access to data analysis results (stats and visualizations) within the app builder

Removing an added dataset. The dataset is still retained in the project for other appsRemoving an added dataset. The dataset is still retained in the project for other apps


Let us proceed to add the other two datasets: loans.csv and transactions.csv. Once all three datasets are added, we can see them back in the app builder’s “Prepare data” stage.

Preparing the training dataset

The next step is to prepare datasets. We aim to have a single dataset at the end that contains all the information about the loans, stats from the customers' transaction history, and their demographic information in a single tabular-form dataset, so we have a ML-ready dataset to proceed with. If in doubt, we can use the “View guide” link next in the header:

View guide” links“View guide” links are provided on the Engine where necessary

This pops up a few short slides visually explaining how we are supposed to prepare data:

Slide within “View guide”Slide within “View guide”: Explaining the required structure of the training data

Slide within “View guide”Slide within “View guide”: Explaining how to prepare a dataset

Slide within “View guide”: Explaining  how to use the prepared dataset as training inputSlide within “View guide”: Explaining how to use the prepared dataset as training input


The Engine provides such guides where necessary to familiarize users with the steps needed to be taken.

Continuing with our use case, let us first prepare the loans and transactions datasets.

Preparing the transactions dataset

Let us first look at the transactions dataset. To prepare a dataset, we first click on it and then choose the option to “Prepare data”.

Select the dataset, then hit “prepare data”Select the dataset, then click on “prepare data”

As you can see from the three options, we can either:

  • Create a new recipe

  • Copy and modify an existing recipe

  • Apply an existing recipe as-is without modification

    Let us choose the first option, and proceed with the recipe name as “Prepare transactions”:

image-20230703-233243Creating a new recipe. We can also copy and modify an existing recipe or apply an existing recipe “as is” to the selected dataset


We are then taken to the recipe editor. After a short wait, we can start adding actions to prepare data. On the recipe editor, we can see a preview of the dataset after the added actions have been applied. We haven’t added any, so we are just seeing the preview of the original data here. To add actions, we use the right-side panel:

The recipe editorThe recipe editor: Actions can be added using the “Add actions” button on the right panel


The list of actions needed to prepare a dataset differs from dataset to dataset and use case to use case, and is to be decided by us based on our understanding of the dataset, the domain, and the business problem.

For the transactions dataset, all we need is a translation of the non-English (Czech language) labels in the type column into English-language labels.

To do this, we first search for the necessary action using the search bar under “All actions”:

image-20230704-001113Use the search bar to find the necessary action

The “Map to Text” option is appropriate in this situation. We then choose the desired action to be added and configure it properly. Here, we just need to apply the following translations and add the action:

  • PRIJEM -> credit

  • VYDAJ -> withdrawal

  • VYBER -> withdrawal_cash

Once the configuration is completed, we can add the action:

Configuring the Map to Text action

Configuring the Map to Text action: 1. Choose columns(s) to which the mapping is to be applied , 2. Provide the values to be mapped, 3. Add the action


We can then see the added action in the recipe. Notice that the preview has updated by running the action on the sample, as we can now see the translated labels:

Values in the type column translated after the action is appliedValues in the type column translated after the action is app


Action:

Accessing the “Map to Text” actionAccessing the “Map to Text” action


Once an action is added to a recipe, if we need to change our mind, we have the “Edit” and “Delete” options:

Modifying a recipe by editing or deleting an actionModifying a recipe by editing or deleting an action


Having ensured we have added the actions we want to the recipe, let us finish it by clicking “Run recipe” and provide a name for the output dataset:

The recipe to be run on the transactions datasetThe recipe to be run on the transactions dataset

Configuring the name of the new dataset created by the recipe runConfiguring the name of the new dataset created by the recipe run


Once we confirm the name of the output dataset, we can see the recipe and the output dataset being generated in the “Prepare data” stage of the app builder:

A recipe in “Running” status with the output dataset yet to be generatedA recipe in “Running” status with the output dataset yet to be generated


We do not need to wait here, we can proceed to the next dataset, which is “loans”.

Preparing the loans dataset

We follow the same process to create an appropriate recipe. Here is a summary of the steps:

The recipe for the “loans” datasetThe recipe for the “loans” dataset

We finish the recipe by issuing “Run recipe”. After some time, we can see both the recipes have finished running and have produced the prepared datasets as outputs:

The recipes for preparing the transactions and loans have finished running and the prepared datasets are availableThe recipes for preparing the transactions and loans have finished running and the prepared datasets are available

Aggregating and joining to generate the final dataset

Next, we want to generate features to predict the loan status. To do so, we need aggregate stats of transactions over the past up to and until the loan application dates. In turn, we should join the transactions dataset with the loans dataset and filter out transactions that happened later than the loan date, before computing the aggregation, to avoid data leakage from the future into the present. We can do that by selecting the “prepared transactions” dataset and then selecting the “merge with another dataset” option this time:Using the “Merge with another dataset” option (1)A recipe in “Running” status with the output dataset yet to be generated


This process also creates a recipe, but conveniently adds the action to merge the two datasets on our behalf. We are shown different ways to merge with a dataset: Joins such as left join, outer join, and inner join, lookup values from another dataset, and aggregate-lookup. Let us go with the “join” option:

image-20230704-035905Options within “Merge with another dataset”. There are several ways to merge two datasets, like joining


Select the dataset to join with, which is the “prepared loans” dataset, then click “create recipe”:

Selecting the dataset to merge withSelecting the dataset to merge with


We are then taken to the recipe-editor view with the join action open. We can modify and confirm the join parameters inferred by the Engine here. Let us do a right join, since we are only interested in the transactions from accounts that have a loan. Further, from the loans dataset, we can keep all columns except for the account id, as the transaction dataset already has an account id:

A selected merging action is added to a new editable recipeA selected merging action is added to a new editable recipe. We can customize the action further


Save changes to add the action, then we can see columns added from the loans dataset as well:

Result of the join actionResult of the join action. We can see the columns added from the loans dataset


Next, we want to remove the entries where the transaction date is past the loan date, so that we do not leak data from the future into the past. To do so, we choose the “filter” action from the “All actions” catalogue, and enter the formula "date < loan_date" to keep only the rows satisfying this criterion.

Accessing the “filter rows” actionAccessing the “filter rows” action

The (boolean) criterion we want to filter by: Use only the transaction history that happened before the loan agreement dateThe (boolean) criterion we want to filter by: Use only the transaction history that happened before the loan agreement date

Then for each loan, we will compute the following aggregate stats:

  • Number of transactions of each type

  • Minimum, maximum, mean, median, and standard deviation of:

    • amounts for each transaction type

    • interval between transactions of each type

  • Days since the last transaction of each type

To facilitate finding the average interval between transactions of each type, we first want to compute the "lag" function of the date column, partitioned by "account_id" and "type", ordered by the "date"column. This can be accomplished with the “Shift rows…” action:

image-20230704-045931Accessing the “Shift rows” action to get the previous transaction of the same type


Let’s name the output column produced by this action as "previous_date_of_same_transaction_type"

Configuring the “Shift rows” actionConfiguring the “Shift rows” action: We want to partition by both the "account_id" and "type" columns to compute the lagged date that contains the date of the previous transaction of the same type from the same customer as the current transaction


We add this action, and then compute the number of days elapsed between "date" and "previous_date_of_same_transaction_type" using the “compute time elapsed” action, naming the output as "days_since_last_transaction_of_same_type". Note that we need to specify the optional “time unit” parameter as “days” instead of leaving it to be the default setting of “seconds”:

Accessing the “Compute time elapsed” action to compute the difference in days between two datesAccessing the “Compute time elapsed” action to compute the difference in days between two dates

image-20230704-050435Configuring the “Compute time elapsed” action: We want to subtract the date of the previous transaction of the same type from the date of the current transaction, to get the intervals in days between successive transactions of the same type


Then, to compute multiple stats for each transaction type, we can make use of the “Reshape dataset into a pivot table” action, specifying account_id for the group-by column, "type" for the pivot column, and "credit", "withdrawal", and "withdrawal_cash" for the pivot values. We can add multiple aggregates as follows:

Using the “Pivot” action

Using the “Pivot” action to compute aggregate features: We pivot by the transaction’s type to get stats of the amount and interval in days columns under the specified transaction types (pivot values) separately


Upon saving changes, we can see the pivot action is added and the aggregations are generated for each pivot value (transaction type): "credit", "withdrawal", and "withdrawal_cash":

Result of the pivot actionResult of the pivot action: Aggregate features can be seen in the preview


After this we add a few more actions:

image-20230705-013237Completing the final recipe with the required actions: Getting the features from the loans dataset back, then getting features from the clients dataset, computing recency and age features

Here are the reasoning behind these actions these actions. After adding the pivot action to compute aggregates by transaction type, the columns from the “loans” dataset were lost, but we get them back by joining again using the same join keys. We then join with the “clients” data to get the demographic features "date_of_birth", "age" and "district_id".

We then extract more features from the datetime columns:

  • Days since last transaction of each type at loan-application day: using the “Compute time elapsed” action

  • Age of the person: using the “Compute time elapsed” action, followed by division by 365 to convert days to years

As a final step, we drop all columns of "DateTime" type (as we have already extracted features from them) and the column "account_id" as it is not a feature column, and the intermediate column storing the age in days instead of years.

We can now run this recipe to produce the final training dataset:

Training dataset in “Ready” statusOutput dataset of the recipe joining and aggregating from three sources

We need one more important feature from the transactions dataset: The minimum balance at the time of loan application. To get this feature, we start again from the prepared transactions dataset. We will be joining with both the prepared loans dataset and the dataset output by the latest recipe we worked on.

Here is a summary of the recipe:

Recipe to generate the “minimum balance” featureRecipe to generate the “minimum balance” feature and the input dataset it is created from

image-20230904-064155

We do a right join with the prepared loans dataset and filter out transactions whose dates are later than the loan application date. We then compute the minimum of the balance column grouped by the account_id column. We then join with the features that we generated before, using the "account_id" column as the key.

Selecting the training dataset

Let us next define what to predict. To proceed to that step, we need to select our final dataset and set it as the training dataset:

Selecting the training datasetSelecting the training dataset


We can then click “Save and continue” to proceed to the next step, to define what to predict:

Step 1 is completed & proceeding to Step 2Step 1 is completed & proceeding to Step 2

Selecting the target column and problem type

In the next step, we first need to select the target column. Here, it is the column named “status”:

The target column needs to be specifiedThe target column needs to be specified

Once the column is chosen, the Engine automatically selects “binary classification” with “debt” as the positive label and disables “Regression”, as we are predicting two classes with labels “debt” and “no problem”. We can also configure the train/test split. Let us leave it as the default of 80% train (and 20% test) this time:

image-20230707-044837For the specified target column, the prediction type and the train/test split configuration can be chosen


We hit “save and continue” again to proceed to the next step to select features.

Selecting features

Here, we can either let the Engine:

  1. Automatically select the best set of features such that the top most features with a total importance reaches a desired percentage threshold

  2. Manually select the features with the feature-importance estimates

Selecting features: Automatic and manual optionsSelecting features: Automatic and manual options


Initially, the feature-importance estimates will not be available as the Engine needs to compute them in the background. If we are confident, we have the option to manually select all of them, but for this use case walkthrough, we will wait instead.

_Computing feature importancesThe Engine takes a bit of time to compute feature importances

image-20230704-060306 (1)
Once the estimation is done, we can sort features by importance and select the top most ones that reach a certain total percentage threshold:

image-20230704-060851Features ordered by decreasing importance: Available when the Engine finishes analyzing data


Here, we are also required to select the “feature types”. In this case, all our features are Numeric; we do not have any nominal (categorical) or descriptive (text) features in this use case. We can now proceed to the next step.

Selecting algorithms to train ML models

In the final step of the app builder, we specify how the Engine should train the ML models. We could let the Engine choose the best algorithms and train them based on estimates that satisfy certain desired criteria, or we can manually choose them. We opt for the former, letting the Engine choose 3 algorithms with estimated prediction quality at least 60% and estimated training time less than 2 hours:

select algorithmsLetting the Engine select algorithms. We can specify selection criteria to optimize the outcome


When we click “Save and continue”, the app starts processing datasets and training models.

Finishing the process with the App summary page

We are taken to the App’s summary page, showing the progress status of various steps:

Summary page of an app in “processing” stateSummary page of an app in “processing” state


When the app finishes processing and models are trained, the top model’s performance will appear on the right side, along with the prediction quality:

Summary page of an app

Summary page of an app in “Ready” state. We can train more models, or make predictions when one or more models have finished training

Model Insights – Evaluation and Explainability

We can see 72.58% prediction quality, which is great given that the ratio of "debt" to "no_problem" cases is very small in the dataset. Upon clicking the model name from this section, we can see the details of the model. It shows further details about the model, and what you can do with it:

The model details pageThe model details page

When we click on “View details” under performance, we can see the detailed evaluation report, including the metrics and charts:

Evaluation metrics and charts for the trained model

Evaluation metrics and charts for the trained model


In the “Feature Importance” tab, we can generate feature importances, so that we understand which features impact the model most:

Generating feature importancesGenerating feature importances

Feature-importance generation is under wayFeature-importance generation is under way

Result: The generated feature importance scores for our modelResult: The generated feature importance scores for our model

This will also generate the Prediction Explanations for a sample of the predictions generated on the test portion, available in the next tab:

image-20230705-051630Sample prediction explanations generated for our model

Making predictions with trained models

We are satisfied with the model’s performance and the insights that explain how the model makes predictions. The next step is to make predictions whenever we have new data, and there are two ways to make predictions:

  • One-off prediction

  • Scheduled periodic predictions

We first select “Make predictions” from the app summary page, and choose to make either one-off or scheduled predictions.

Making predictionsMaking predictions: Access from the summary page of the app


Either option will take us to the prediction setup-up page, where we can see the recommended model selected for us already, but we have the option to use a different model if we want to:

Selecting the model to be used to make predictionsSelecting the model to be used to make predictions. The recommended model is chosen based on the highest prediction quality


In the next step, we have the prediction pipeline builder. We have the placeholders to add the input datasets first, with the links to the corresponding datasets used in training under each. Here, we can either add a dataset directly or configure a data source (such as a database connection) to ingest latest data from a table on a periodic basis, if using the “Schedule periodic predictions” option:

The prediction pipeline to be configuredThe prediction pipeline to be configured: Placeholders for input datasets and recipes are shown, following the same structure created in Step 1 of the app builder


Once an input dataset is added, the corresponding recipe(s) will be populated as its child node(s), if one or more recipes were applied at the time of training to the corresponding training dataset:

Recipes are populated once an input dataset is addedRecipes are populated once an input dataset is added


We can also remove these recipes and add modifiable copies of the original recipe, if we want to make changes to the training recipe. This might be necessary for example to remove recipe actions added at the time of building the app to generate / modify the target column:

We can remove a recipe We can remove a recipe and replace it with another one or create a new one to run during prediction

Applying a different recipe for predictionApplying a different recipe for prediction

Options to apply a different recipe for predictionOptions to apply a different recipe for prediction


After adding all the input datasets and the recipes, and all recipes & model inputs are validated, the configuration will be shown as “completed” and we can proceed to the next step:

Completed prediction-pipeline configuration: Input datasets and recipesCompleted prediction-pipeline configuration: Input datasets and recipes

Completed prediction-pipeline configuration: Prediction dataset and modelCompleted prediction-pipeline configuration: Prediction dataset and model


The prediction will appear as a tabular dataset with these additional columns, along with the features used to make predictions (output of the final recipe before the model). In the next step, we can configure the names of these prediction output columns. Here, since the use case is binary classification, we can also choose the threshold applied to the probability to get the binary label. Here is an example configuration:

Configuring the names of the prediction-output columnsConfiguring the names of the prediction-output columns


After this, we can optionally define output destinations for the predictions. The same prediction can be sent to different output destinations simultaneously. This can be either to a project within the organization with the current project being the default, or a database table connection:

Optionally, we can export predictions generated into different destinationsOptionally, we can export predictions generated into different destinations


This step is optional; predictions will still be available under the app for download/export later even if no output destinations are defined at this point.

If using scheduled predictions, in the next step we can configure when predictions should be run:

Defining a schedule to run predictionsDefining a schedule to run predictions

Once all these steps are completed, either:

  • A prediction run is immediately started, if making a one-off prediction, or

  • A prediction run will start at the scheduled date/time on a periodic basis

Once a prediction run starts, it takes a few minutes to generate the final predictions, depending on the size of the datasets and the complexity of the data-preparation pipeline. We can see them in the app summary page with the status indicators:

Predictions that have run will appear in the summary page of the appPredictions that have run will appear in the summary page of the app

Once predictions are ready, we can preview them, download as a CSV file, or export them to the same/another project or to a database table:

Prediction run finishedPrediction run finished: Results can be previewed, downloaded (as file) or exported

Prediction output previewPrediction output preview: The last two columns here contain the prediction outputs i.e., the predicted label for the given threshold and the probability score produced by the model

Conclusion

We have built an application and generated predictions on new data for a loan-status prediction use case in the banking industry. The following is a summary of the steps we have taken:

  • Prepare an ML-ready training dataset

  • Define the target column for prediction, choose features, and apply the best algorithms to train our models

  • Re-use the recipes used to generate the training dataset at prediction time, to transform new data coming in the original schema into an ML-ready prediction dataset

  • Specify additional configurations such as choosing between different types of problems (multi-class vs. binary classification) and the positive class label


The advantages offered by the “Build from scratch” function of the Engine include:

  • Integration of data preparation from multiple datasets into the training and prediction flow.

  • No coding required, allowing business analysts, data analysts, and data scientists to focus much better on:

    • Framing a problem for the business in terms of data,

    • Domain knowledge and domain-specific problem solving skills

  • Guidance along the way - for example, in recommending the best problem type and in providing feature-importance estimates to guide selection of features


The loan-status prediction problem in the banking industry in this walkthrough serves as an illustration of the “Build from scratch” function, which means that it can be readily applied to other use cases and industries. This highlights the versatility and applicability of the “Build from scratch” function beyond the specific use case of the banking industry.