This article is an end-to-end walk through of the “build from scratch” approach to build and use a supervised machine-learning pipeline. We will be using a banking use case as an example.
Watch the walkthrough video:
The AI & Analytics Engine provides two fundamental ways for users to build their ML solutions and use them:
Templates primarily cater to business users who are not necessarily familiar with ML and data-science concepts. They build entire ML pipelines requiring only business-specific inputs. On the other hand, the “build from scratch” option provides analysts and data/ML practitioners with easy-to-use no-code tools to define, compose, and run custom ML pipelines to suit various business use cases. It enables such users to focus on domain knowledge and feature engineering rather than coding and engineering tasks.
In this article, we will consider a specific use case to walk through the process of building and using such a ML pipeline.
The use case
For the purpose of this walkthrough, let’s consider a scenario where your role is a data analyst/scientist in a banking institution, and your objective is to use data to optimize the loan-approval process. The main challenge is to distinguish between loan applicants who are likely to repay their loans on time, and those who are prone to default or delay their payments, by building a model that learns from historical transactions of loan applicants and their final outcomes. This is a binary classification problem, where the features are derived from the loan applications, the customer demographics and their historical transactions, with the target variable being the loan status (“no problem” or “debt”). To expedite the project delivery, the preferred approach is to utilize no-code tools that can eliminate the coding effort, the experimentation time on various methods and also handle the model deployment in production.
The data used in this article are obtained from the Czech bank dataset, which you can download here.
The input data for this use case consists of the following tables:
loans: this table contains information about the loan applications, such as loan ID, customer ID, loan amount, loan duration, and loan status. The loan status is a categorical column with four possible values: A (finished, no problem), B (finished, debt), C (running, no problem), and D (running, debt). This column will be transformed into a binary column with two values: “no problem” and “debt”, which will be used as the target column for the classification problem.
clients: this table contains customer records containing the following attributes: gender, date of birth, and location.
transactions: this table of historical records of customer transactions. Each record contains the customer ID, transaction date, transaction amount, transaction type and balance.
Preview of the loans table
Preview of the clients table
Preview of the transactions table
Getting started with the App Builder
The first step is to login to the Engine, and ensure that you are in the right project in which you want to create the app:
Step 1: Log in to the Engine, then ensure you are in the right project
Step 2: Choose “Build from scratch”
Next, we are prompted to choose two options. Here, we need to choose “Regression or classification” as the task we have for our banking use case is classification:
Step 3: Choose “Regression or classification”
Then we need to provide the app an appropriate name to proceed. Here, we will leave it as the default app name, as we are able to change the name an time later from the “Settings” tab of the App.
Step 4: Configure the app name
Once we confirm the app name, we are taken to the “App builder” view. It contains the steps to prepare data, define what to predict (the target column and the problem type), select the features and then build models.
In the first step, we add the datasets relevant to the use case and generate a single dataset containing the feature columns and the target column, using appropriate data-preparation steps. We can either add existing datasets in the same project that we previously imported, or import them new. Since we do not have any existing datasets, let us proceed with the “Import” option:
Choose “Import new dataset” to add datasets to the App Builder
Under the import options that show up next, we choose the “File upload” option and then choose the file that we want to upload, and proceed. Let us first choose the “clients” dataset for upload:
We can go to the next step, where we can see the preview and the schema after a very short wait:
The schema that the Engine has inferred looks satisfactory, especially since it has correctly detected the type of “date_of_birth” as “datetime” and all other columns except “gender” as numeric.
Upon confirming the creation of the dataset, we are taken to the app builder where we can see the added dataset:
Dataset added to “Prepare Data” step of the App Builder
We can examine the column stats and visualizations with the “View Analysis” icon or remove the dataset from this app (but keep it in the project) with the cross icon:
Quick access to data analysis results (stats and visualizations) within the app builder
Removing an added dataset. The dataset is still retained in the project for other apps
Let us proceed to add the other two datasets: loans.csv and transactions.csv. Once all three datasets are added, we can see them back in the app builder’s “Prepare data” stage.
Preparing the training dataset
The next step is to prepare datasets. We aim to have a single dataset at the end that contains all the information about the loans, stats from the customers' transaction history, and their demographic information in a single tabular-form dataset, so we have a ML-ready dataset to proceed with. If in doubt, we can use the “View guide” link next in the header:
“View guide” links are provided on the Engine where necessary
This pops up a few short slides visually explaining how we are supposed to prepare data:
Slide within “View guide”: Explaining the required structure of the training data
Slide within “View guide”: Explaining how to prepare a dataset
Slide within “View guide”: Explaining how to use the prepared dataset as training input
The Engine provides such guides where necessary to familiarize users with the steps needed to be taken.
Continuing with our use case, let us first prepare the loans and transactions datasets.
Preparing the transactions dataset
Let us first look at the transactions dataset. To prepare a dataset, we first click on it and then choose the option to “Prepare data”.
Select the dataset, then click on “prepare data”
As you can see from the three options, we can either:
Create a new recipe
Copy and modify an existing recipe
Apply an existing recipe as-is without modification
Let us choose the first option, and proceed with the recipe name as “Prepare transactions”:
Creating a new recipe. We can also copy and modify an existing recipe or apply an existing recipe “as is” to the selected dataset
We are then taken to the recipe editor. After a short wait, we can start adding actions to prepare data. On the recipe editor, we can see a preview of the dataset after the added actions have been applied. We haven’t added any, so we are just seeing the preview of the original data here. To add actions, we use the right-side panel:
The recipe editor: Actions can be added using the “Add actions” button on the right panel
The list of actions needed to prepare a dataset differs from dataset to dataset and use case to use case, and is to be decided by us based on our understanding of the dataset, the domain, and the business problem.
For the transactions dataset, all we need is a translation of the non-English (Czech language) labels in the type column into English-language labels.
To do this, we first search for the necessary action using the search bar under “All actions”:
Use the search bar to find the necessary action
The “Map to Text” option is appropriate in this situation. We then choose the desired action to be added and configure it properly. Here, we just need to apply the following translations and add the action:
PRIJEM -> credit
VYDAJ -> withdrawal
VYBER -> withdrawal_cash
Once the configuration is completed, we can add the action:
Configuring the Map to Text action: 1. Choose columns(s) to which the mapping is to be applied , 2. Provide the values to be mapped, 3. Add the action
We can then see the added action in the recipe. Notice that the preview has updated by running the action on the sample, as we can now see the translated labels:
Values in the type column translated after the action is app
Accessing the “Map to Text” action
Once an action is added to a recipe, if we need to change our mind, we have the “Edit” and “Delete” options:
Modifying a recipe by editing or deleting an action
Having ensured we have added the actions we want to the recipe, let us finish it by clicking “Run recipe” and provide a name for the output dataset:
The recipe to be run on the transactions dataset
Configuring the name of the new dataset created by the recipe run
Once we confirm the name of the output dataset, we can see the recipe and the output dataset being generated in the “Prepare data” stage of the app builder:
A recipe in “Running” status with the output dataset yet to be generated
We do not need to wait here, we can proceed to the next dataset, which is “loans”.
Preparing the loans dataset
We follow the same process to create an appropriate recipe. Here is a summary of the steps:
The recipe for the “loans” dataset
We finish the recipe by issuing “Run recipe”. After some time, we can see both the recipes have finished running and have produced the prepared datasets as outputs:
The recipes for preparing the transactions and loans have finished running and the prepared datasets are available
Aggregating and joining to generate the final dataset
Next, we want to generate features to predict the loan status. To do so, we need aggregate stats of transactions over the past up to and until the loan application dates. In turn, we should join the transactions dataset with the loans dataset and filter out transactions that happened later than the loan date, before computing the aggregation, to avoid data leakage from the future into the present. We can do that by selecting the “prepared transactions” dataset and then selecting the “merge with another dataset” option this time:A recipe in “Running” status with the output dataset yet to be generated
This process also creates a recipe, but conveniently adds the action to merge the two datasets on our behalf. We are shown different ways to merge with a dataset: Joins such as left join, outer join, and inner join, lookup values from another dataset, and aggregate-lookup. Let us go with the “join” option:
Options within “Merge with another dataset”. There are several ways to merge two datasets, like joining
Select the dataset to join with, which is the “prepared loans” dataset, then click “create recipe”:
Selecting the dataset to merge with
We are then taken to the recipe-editor view with the join action open. We can modify and confirm the join parameters inferred by the Engine here. Let us do a right join, since we are only interested in the transactions from accounts that have a loan. Further, from the loans dataset, we can keep all columns except for the account id, as the transaction dataset already has an account id:
A selected merging action is added to a new editable recipe. We can customize the action further
Save changes to add the action, then we can see columns added from the loans dataset as well:
Result of the join action. We can see the columns added from the loans dataset
Next, we want to remove the entries where the transaction date is past the loan date, so that we do not leak data from the future into the past. To do so, we choose the “filter” action from the “All actions” catalogue, and enter the formula "date < loan_date" to keep only the rows satisfying this criterion.
Accessing the “filter rows” action
The (boolean) criterion we want to filter by: Use only the transaction history that happened before the loan agreement date
Then for each loan, we will compute the following aggregate stats:
Number of transactions of each type
Minimum, maximum, mean, median, and standard deviation of:
amounts for each transaction type
interval between transactions of each type
Days since the last transaction of each type
To facilitate finding the average interval between transactions of each type, we first want to compute the "lag" function of the date column, partitioned by "account_id" and "type", ordered by the "date"column. This can be accomplished with the “Shift rows…” action:
Accessing the “Shift rows” action to get the previous transaction of the same type
Let’s name the output column produced by this action as "previous_date_of_same_transaction_type"
Configuring the “Shift rows” action: We want to partition by both the "account_id" and "type" columns to compute the lagged date that contains the date of the previous transaction of the same type from the same customer as the current transaction
We add this action, and then compute the number of days elapsed between "date" and "previous_date_of_same_transaction_type" using the “compute time elapsed” action, naming the output as "days_since_last_transaction_of_same_type". Note that we need to specify the optional “time unit” parameter as “days” instead of leaving it to be the default setting of “seconds”:
Accessing the “Compute time elapsed” action to compute the difference in days between two dates
Configuring the “Compute time elapsed” action: We want to subtract the date of the previous transaction of the same type from the date of the current transaction, to get the intervals in days between successive transactions of the same type
Then, to compute multiple stats for each transaction type, we can make use of the “Reshape dataset into a pivot table” action, specifying account_id for the group-by column, "type" for the pivot column, and "credit", "withdrawal", and "withdrawal_cash" for the pivot values. We can add multiple aggregates as follows:
Using the “Pivot” action to compute aggregate features: We pivot by the transaction’s type to get stats of the amount and interval in days columns under the specified transaction types (pivot values) separately
Upon saving changes, we can see the pivot action is added and the aggregations are generated for each pivot value (transaction type): "credit", "withdrawal", and "withdrawal_cash":
Result of the pivot action: Aggregate features can be seen in the preview
After this we add a few more actions:
Completing the final recipe with the required actions: Getting the features from the loans dataset back, then getting features from the clients dataset, computing recency and age features
Here are the reasoning behind these actions these actions. After adding the pivot action to compute aggregates by transaction type, the columns from the “loans” dataset were lost, but we get them back by joining again using the same join keys. We then join with the “clients” data to get the demographic features "date_of_birth", "age" and "district_id".
We then extract more features from the datetime columns:
Days since last transaction of each type at loan-application day: using the “Compute time elapsed” action
Age of the person: using the “Compute time elapsed” action, followed by division by 365 to convert days to years
As a final step, we drop all columns of "DateTime" type (as we have already extracted features from them) and the column "account_id" as it is not a feature column, and the intermediate column storing the age in days instead of years.
We can now run this recipe to produce the final training dataset:
Output dataset of the recipe joining and aggregating from three sources
We need one more important feature from the transactions dataset: The minimum balance at the time of loan application. To get this feature, we start again from the prepared transactions dataset. We will be joining with both the prepared loans dataset and the dataset output by the latest recipe we worked on.
Here is a summary of the recipe:
Recipe to generate the “minimum balance” feature and the input dataset it is created from
We do a right join with the prepared loans dataset and filter out transactions whose dates are later than the loan application date. We then compute the minimum of the balance column grouped by the account_id column. We then join with the features that we generated before, using the "account_id" column as the key.
Selecting the training dataset
Let us next define what to predict. To proceed to that step, we need to select our final dataset and set it as the training dataset:
Selecting the training dataset
We can then click “Save and continue” to proceed to the next step, to define what to predict:
Step 1 is completed & proceeding to Step 2
Selecting the target column and problem type
In the next step, we first need to select the target column. Here, it is the column named “status”:
The target column needs to be specified
Once the column is chosen, the Engine automatically selects “binary classification” with “debt” as the positive label and disables “Regression”, as we are predicting two classes with labels “debt” and “no problem”. We can also configure the train/test split. Let us leave it as the default of 80% train (and 20% test) this time:
For the specified target column, the prediction type and the train/test split configuration can be chosen
We hit “save and continue” again to proceed to the next step to select features.
Here, we can either let the Engine:
Automatically select the best set of features such that the top most features with a total importance reaches a desired percentage threshold
Manually select the features with the feature-importance estimates
Selecting features: Automatic and manual options
Initially, the feature-importance estimates will not be available as the Engine needs to compute them in the background. If we are confident, we have the option to manually select all of them, but for this use case walkthrough, we will wait instead.
The Engine takes a bit of time to compute feature importances
Once the estimation is done, we can sort features by importance and select the top most ones that reach a certain total percentage threshold:
Features ordered by decreasing importance: Available when the Engine finishes analyzing data
Here, we are also required to select the “feature types”. In this case, all our features are Numeric; we do not have any nominal (categorical) or descriptive (text) features in this use case. We can now proceed to the next step.
Selecting algorithms to train ML models
In the final step of the app builder, we specify how the Engine should train the ML models. We could let the Engine choose the best algorithms and train them based on estimates that satisfy certain desired criteria, or we can manually choose them. We opt for the former, letting the Engine choose 3 algorithms with estimated prediction quality at least 60% and estimated training time less than 2 hours:
Letting the Engine select algorithms. We can specify selection criteria to optimize the outcome
When we click “Save and continue”, the app starts processing datasets and training models.
Finishing the process with the App summary page
We are taken to the App’s summary page, showing the progress status of various steps:
Summary page of an app in “processing” state
When the app finishes processing and models are trained, the top model’s performance will appear on the right side, along with the prediction quality:
Summary page of an app in “Ready” state. We can train more models, or make predictions when one or more models have finished training
Model Insights – Evaluation and Explainability
We can see 72.58% prediction quality, which is great given that the ratio of "debt" to "no_problem" cases is very small in the dataset. Upon clicking the model name from this section, we can see the details of the model. It shows further details about the model, and what you can do with it:
The model details page
When we click on “View details” under performance, we can see the detailed evaluation report, including the metrics and charts:
Evaluation metrics and charts for the trained model
In the “Feature Importance” tab, we can generate feature importances, so that we understand which features impact the model most:
Generating feature importances
Feature-importance generation is under way
Result: The generated feature importance scores for our model
This will also generate the Prediction Explanations for a sample of the predictions generated on the test portion, available in the next tab:
Sample prediction explanations generated for our model
Making predictions with trained models
We are satisfied with the model’s performance and the insights that explain how the model makes predictions. The next step is to make predictions whenever we have new data, and there are two ways to make predictions:
Scheduled periodic predictions
We first select “Make predictions” from the app summary page, and choose to make either one-off or scheduled predictions.
Making predictions: Access from the summary page of the app
Either option will take us to the prediction setup-up page, where we can see the recommended model selected for us already, but we have the option to use a different model if we want to:
Selecting the model to be used to make predictions. The recommended model is chosen based on the highest prediction quality
In the next step, we have the prediction pipeline builder. We have the placeholders to add the input datasets first, with the links to the corresponding datasets used in training under each. Here, we can either add a dataset directly or configure a data source (such as a database connection) to ingest latest data from a table on a periodic basis, if using the “Schedule periodic predictions” option:
The prediction pipeline to be configured: Placeholders for input datasets and recipes are shown, following the same structure created in Step 1 of the app builder
Once an input dataset is added, the corresponding recipe(s) will be populated as its child node(s), if one or more recipes were applied at the time of training to the corresponding training dataset:
Recipes are populated once an input dataset is added
We can also remove these recipes and add modifiable copies of the original recipe, if we want to make changes to the training recipe. This might be necessary for example to remove recipe actions added at the time of building the app to generate / modify the target column:
We can remove a recipe and replace it with another one or create a new one to run during prediction
Applying a different recipe for prediction
Options to apply a different recipe for prediction
After adding all the input datasets and the recipes, and all recipes & model inputs are validated, the configuration will be shown as “completed” and we can proceed to the next step:
Completed prediction-pipeline configuration: Input datasets and recipes
Completed prediction-pipeline configuration: Prediction dataset and model
The prediction will appear as a tabular dataset with these additional columns, along with the features used to make predictions (output of the final recipe before the model). In the next step, we can configure the names of these prediction output columns. Here, since the use case is binary classification, we can also choose the threshold applied to the probability to get the binary label. Here is an example configuration:
Configuring the names of the prediction-output columns
After this, we can optionally define output destinations for the predictions. The same prediction can be sent to different output destinations simultaneously. This can be either to a project within the organization with the current project being the default, or a database table connection:
Optionally, we can export predictions generated into different destinations
This step is optional; predictions will still be available under the app for download/export later even if no output destinations are defined at this point.
If using scheduled predictions, in the next step we can configure when predictions should be run:
Defining a schedule to run predictions
Once all these steps are completed, either:
A prediction run is immediately started, if making a one-off prediction, or
A prediction run will start at the scheduled date/time on a periodic basis
Once a prediction run starts, it takes a few minutes to generate the final predictions, depending on the size of the datasets and the complexity of the data-preparation pipeline. We can see them in the app summary page with the status indicators:
Predictions that have run will appear in the summary page of the app
Once predictions are ready, we can preview them, download as a CSV file, or export them to the same/another project or to a database table:
Prediction run finished: Results can be previewed, downloaded (as file) or exported
Prediction output preview: The last two columns here contain the prediction outputs i.e., the predicted label for the given threshold and the probability score produced by the model
We have built an application and generated predictions on new data for a loan-status prediction use case in the banking industry. The following is a summary of the steps we have taken:
Prepare an ML-ready training dataset
Define the target column for prediction, choose features, and apply the best algorithms to train our models
Re-use the recipes used to generate the training dataset at prediction time, to transform new data coming in the original schema into an ML-ready prediction dataset
Specify additional configurations such as choosing between different types of problems (multi-class vs. binary classification) and the positive class label
The advantages offered by the “Build from scratch” function of the Engine include:
Integration of data preparation from multiple datasets into the training and prediction flow.
No coding required, allowing business analysts, data analysts, and data scientists to focus much better on:
Framing a problem for the business in terms of data,
Domain knowledge and domain-specific problem solving skills
Guidance along the way - for example, in recommending the best problem type and in providing feature-importance estimates to guide selection of features
The loan-status prediction problem in the banking industry in this walkthrough serves as an illustration of the “Build from scratch” function, which means that it can be readily applied to other use cases and industries. This highlights the versatility and applicability of the “Build from scratch” function beyond the specific use case of the banking industry.