This article is an end-to-end guide of the “build from scratch” approach to building and using a supervised machine learning pipeline, using banking data.
Watch the walkthrough video:
The AI & Analytics Engine provides two fundamental ways for users to build their ML solutions and use them:
The Build from scratch option provides you with an easy-to-use no-code tool to define, compose, and run custom ML pipelines to suit various business use cases.
In this article, a banking use case is used to walk you through the process of building and using an ML pipeline.
The binary classification banking use case
Consider a scenario where your role is a data analyst/scientist in a banking institution, and your objective is to use data to optimize the loan approval process.
The main challenge is to distinguish between loan applicants who are likely to repay their loans on time, and those who are prone to default or delay their payments, by building a model that learns from historical transactions of loan applicants and their final outcomes.
This is a binary classification problem, where the features are derived from the loan applications, the customer demographics, and their historical transactions, with the target variable being the loan status (no problem or debt).
📂The Czech bank dataset was used for this guide. Why not download the datasets and generate your own predictions?
The input data for this use case consists of three tables:
Loans dataset: This table contains information about loan applications, such as loan ID, customer ID, loan amount, loan duration, and loan status. The loan status is a categorical column with four possible values:
A (finished, no problem),
B (finished, debt),
C (running, no problem), and
D (running, debt).
The loan status column will be transformed into a binary column with two values: no problem and debt, which will be used as the target column for the classification problem.
Clients dataset: This table contains customer records containing the following attributes: gender, date of birth, and location.
Transactions dataset: This table of historical records of customer transactions. Each record contains the customer ID, transaction date, transaction amount, transaction type, and balance.
Preview of the loans table
Preview of the clients table
Preview of the transactions table
Getting started with the App Builder
The first step is to log in to the Engine and ensure that you are in the right project in which you want to create the app. From the two options shown to build your app, choose Build from scratch.
Project overview page, select app builder Build from scratch
Next, we are prompted to choose two options. Here, we need to choose Regression or classification as the task we have for our banking use case is classification:
App builder, select problem type
Once you confirm the app name, you are taken to the App builder view. It contains the following steps.
Define what to predict (the target column and the problem type)
Select the features and
Step 1: Prepare data
In the first step, the goal is to add the three datasets relevant to the use case and generate a single dataset containing the feature columns and the target column. This single dataset is called an ML-ready dataset.
🎓 An ML-ready dataset is required to build ML model. For more information, read what is a machine learning-ready dataset
To do this, you can either add existing datasets from the same project that you had previously imported or you can import a new one. Since there are no existing datasets, proceed with the Import option.
App builder pipeline, import new data into app
When the app finishes processing and models are trained, the top model’s performance will appear on the right side, along with the prediction quality.
Summary page of an app in “Ready” state. We can train more models, or make predictions when one or more models have finished training
Understanding your model
Head to the Model tab, from there you can select any model by clicking the name of the model. This will take you to the Model summary page.
Model Insights – Evaluation and Explainability
You can see 72.58% prediction quality, which is great given that the ratio of debt to no_problem cases is very small in the dataset.
Upon clicking the model name from this section, you can see details about the model, and what you can do with it.
Select View details” for the performance card.
The model details page
Here you can see the detailed evaluation report, including the metrics and charts.
🎓 To learn more about model evaluation metrics read Which metrics are used to evaluate a binary classification model's performance?
Evaluation metrics and charts for the trained model
In the Feature Importance tab, you can generate feature importances, so that you understand which features impact the model most.
The generated feature importance scores for our model
This will also generate the Prediction Explanation for a sample of the predictions generated on the test portion, available in the next tab.
Sample prediction explanations generated for our model
Making predictions with trained models
If you are satisfied with the model’s performance and the insights that explain how the model makes predictions.
The next step is to make predictions whenever you have new data, there are two ways to make predictions:
Scheduled periodic predictions
First select Make predictions from the App summary page, and choose to make either one-off or scheduled predictions.
Making predictions: Access from the summary page of the app
Either option will take you to the prediction pipeline page, where you can see the recommended model selected already, you have the option to use a different model if desired.
In the next step, you can visualize the prediction pipeline builder. There are placeholders to add the input datasets first, with the links to the corresponding datasets used in training under each.
Here, you can either add a dataset directly or configure a data source (such as a database connection) to ingest the latest data from a table on a periodic basis, if using the “Schedule periodic predictions” option.
The prediction pipeline to be configured: Placeholders for input datasets and recipes are shown, following the same structure created in Step 1 of the app builder
Once an input dataset is added, the corresponding recipe(s) will be populated as its child node(s), if one or more recipes were applied at the time of training to the corresponding training dataset.
Recipes are populated once an input dataset is added
You can also remove these recipes and add modifiable copies of the original recipe if you wish to make changes to the training recipe. This might be necessary for example to remove recipe actions added at the time of building the app to generate/modify the target column.
Remove a recipe and replace it with another one or create a new one to run during the prediction
After adding all the input datasets and the recipes, and all recipes and model inputs are validated, the configuration will be shown as “completed” and you can proceed to the next step.
Completed prediction-pipeline configuration: Prediction dataset and model
The prediction will appear as a tabular dataset with these additional columns, along with the features used to make predictions (output of the final recipe before the model).
In the next step, you can configure the names of these prediction output columns. Here, since the use case is binary classification, we can also choose the threshold applied to the probability to get the binary label. Here is an example configuration:
Configuring the names of the prediction-output columns
After this, you can optionally define output destinations for the predictions.
The same prediction can be sent to different output destinations simultaneously. This can be either to a project within the organization with the current projeect being the default or a database table connection.
This step is optional; predictions will still be available under the app for download/export later even if no output destinations are defined at this point. If using scheduled predictions, in the next step you can configure when predictions should be run.
Optionally, predictions generated can be exported into different destinations
Once all these steps are completed, either:
A prediction run is immediately started, if making a one-off prediction, or;
A prediction run will start at the scheduled date/time on a periodic basis.
Once a prediction run starts, it takes a few minutes to generate the final predictions, depending on the size of the datasets and the complexity of the data-preparation pipeline.
You can predictions on the app summary page with the status indicators.
Predictions that have run will appear in the summary page of the app
Once predictions are ready, you can preview them, download them as a CSV file, or export them to the same/another project or to a database table.
Prediction run finished: Results can be previewed, downloaded (as a file) or exported
Select Preview details to view a preview of the dataset. Here you can see that the last two columns contain the prediction outputs i.e., the predicted label for the given threshold and the probability score produced by the model.
Prediction output preview
You have built an application and generated predictions on new data for a loan-status prediction use case in the banking industry. The following is a summary of the steps taken:
Prepared an ML-ready training dataset.
Defined the target column for prediction, chose features, and applied the best algorithms to train your models.
Re-use the recipes used to generate the training dataset at prediction time, to transform new data coming in the original schema into an ML-ready prediction dataset.
Specified additional configurations such as choosing between different types of problems (multi-class vs. binary classification) and the positive class label