KNIME for Finance: Fraud detection using isolation forest

This is part of a series of articles to show you solutions to common finance tasks related to financial planning, accounting, tax calculations, and auditing problems all implemented with the low-code KNIME Analytics Platform.

Detecting fraudulent transactions oftentimes requires the parsing of very large amounts of data. As fraud patterns continually evolve, the need for a time-efficient algorithm becomes increasingly important. In the last part of our series, we used DBSCAN as an alternative to Quantiles to provide a better solution in Fraud Detection. However, as outlined in the previous article, DBSCAN still has its limitations, particularly in handling large amounts of sample data.

To address these challenges, we introduce Isolation Forest, an unsupervised algorithm that offers significant advantages over DBSCAN for detecting anomalies in large datasets. Unlike traditional methods that attempt to profile normal data points and then identify outliers, Isolation Forest directly isolates anomalies, reducing the need for extensive data processing.

Isolation Forest is not natively supported in Excel, and setting up a proper environment using Python, for example, is necessary. The KNIME Analytics Platform offers out-of-the-box support for Isolation Forest; you only need to download the 'KNIME H2O Machine Learning Integration' extension. Here, we aim to demonstrate through a visual, low-code platform how we can detect fraudulent transactions using Isolation Forest in KNIME Analytics Platform.

What is an Isolation Forest?

Isolation Forest is an algorithm similar to Random Forest in that both use decision trees. However instead of focusing on identifying common data patterns, Isolation Forest is designed specifically to detect anomalies . It does this by randomly selecting a split value within the range of a particular feature. This process is repeated recursively to create isolation trees. Due to this mechanism or using random partitions, the algorithm for isolation forest runs very fast with a linear time complexity. In comparison, algorithms like DBSCAN (which we discussed in our previous article) have log-linear time complexity. This makes isolation forests more efficient, improving runtime performance significantly.

Although Isolation Forest runs quickly, and it can handle large datasets, it is not without limitations.. The algorithm often needs parameter tuning depending on the dataset given and is sensitive to the sample size it is trained on as larger sample sizes have a higher risk of “swamping” and “masking” effects on the model, where the anomalies get buried within normal data, reducing the model's effectiveness

Despite the advantages of Isolation Forest, its limitations need to be carefully managed, especially when applied to real-world datasets where anomalies are sparse. Parameter tuning and understanding the data distribution are still necessary for the algorithm to perform optimally.

With this understanding, let's explore how we can apply Isolation Forest to a practical use case: detecting fraudulent transactions with real credit card data.

Identify fraudulent transactions with isolation forest

Credit card transactions can generally fall into two categories: normal and suspicious. The goal is to accurately identify fraudulent transactions while minimizing the number of false positives—cases where legitimate transactions are incorrectly flagged as fraudulent. Ideally, only a small percentage of flagged transactions should turn out to be false positives.

In our use case, we will focus on automating fraud detection by training a model on a labeled dataset and applying it to a new transaction to simulate incoming data from an outside data source.

For this, we will use the popular Credit Card Fraud Detection dataset from Kaggle.. This dataset consists of real credit card transactions made by European cardholders in September 2013, with the transaction details anonymized for privacy.. It includes a total of 284,807 transactions over a two day period, of which 492 are fraudulent transactions. The dataset represents a severe class imbalance between the ‘good’ (0) and ‘frauds’ (1), where ‘frauds’ account for only 0.172% of the data.

The dataset contains 31 columns:

V1 - V28: These are numerical input variables from a PCA (Principal Component Analysis) transformation
Time: This column represents the time in seconds elapsed from current transaction to first transaction
Amount: The monetary value of the transaction
Class: This is the target variable, where ‘1’ indicates a fraudulent transaction and ‘0’ indicates a normal (non-fraudulent) transaction.

A key feature for our training our model is the “Class” variable as it helps us to evaluate the performance of the algorithm on the dataset.

The process for creating our classification model is outlined below. Even when handling data from multiple sources, the core steps remain consistent:

Create/import a labeled training dataset
Train the model
Evaluate model performance
Importnew, unseen transactions
Deploy the model and feed in new transactions
Notify if any fraudulent transactions are classified

Training a machine learning model to identify fraudulent transactions

All workflows used in this article are available publicly and free to download on the KNIME Community Hub. You can find the workflows on the KNIME for Finance space under Fraud Detection in the Isolation Forest section.

Isolation Forest is a type of unsupervised learning model that is good at identifying outliers by using random partitioning to create trees. This method improves efficiency, particularly for finding outliers.

The first workflow focuses on training our Isolation Forest model. You can view and download the training workflow Isolation Forest Training from the KNIME Community Hub. With this workflow you can:

Read training data from a specified data source. In our case, we use data from a Kaggle dataset.
Preprocess data by splitting the data into 2 sets:
- the top port has ⅔ of the normal transactions
- the bottom contains the normal transactions along with all the fraudulent transactions.
Train the Isolation Forest model using the top port with normal transactions and apply the trained model on the testing set or the bottom port from our preprocessing component.
Classify transactions based on mean length of the isolation tree, where a shorter length indicates a higher likelihood of the transaction being fraudulent
Evaluate model results by opening the view of the Scorer node to check overall accuracy of the model.
Save the model for deployment in the subsequent workflow if the performance meets expectations.

Figure 1: Workflow for training and scoring the model using isolation forest

In our second workflow, Isolation Forest Deployment, you can perform the following steps:

Read the model and new data for classification
Apply the Isolation Forest to the incoming transaction/new data
Classify the new transaction using the mean length from the ‘H2O Isolation Forest Predictor’. A shorter path length typically indicates a higher likelihood of fraud.
Send an Email notification automatically to relevant parties if a transaction is flagged as fraudulent.

Figure 2: Workflow for deploying the trained model on new data

Inside the Send Email component, we evaluate whether the transaction has been classified as fraudulent or not. If it is indeed flagged as fraudulent, an email is sent to the specified person for a further follow up.

An Isolation forest model for Classifying Transactions

In our training workflow, opening up the Scorer (Javascript) node gives us the confusion matrix below.

We have a confusion matrix that summarizes the performance of the Isolation Forest classification method, indicating an overall accuracy of 97.49%. This method performs essentially the same as some of our well-performing previous techniques, such as Random Forest and DBSCAN.

Isolation Forest is a great alternative to DBSCAN, especially due to its increased efficiency in runtime. While DBSCAN is robust in detecting clusters and handling noise, Isolation Forest is particularly effective in finding outliers, making it particularly suitable for fraud detection. The improved speed and efficiency make it an ideal option for real-time applications such as fraud detection.

When a new transaction is processed, it is converted.into an H2O frame and fed into the model predictor. This step provides the classification value for the transaction.. If the new transaction is classified as ‘not fraudulent’ or ‘good’, our switch statement closes the port to the email, ensuring that no alerts are sent for legitimate transactions

Above we have two snippets which show the expected outcomes based on the classification of the transaction.

On the left, we see the scenario for a non-fraudulent transaction (or ‘0’). In this case, the system processes the transaction without any alerts, and the port remains closed, indicating that no email notification is sent.
On the right , the port opens if the transaction is classified as fraudulent (or ‘1’). This triggers the system to send an email notification to the specified recipient, alerting them of the potentially fraudulent activity and prompting further investigation.

Why KNIME for Finance

KNIME Analytics Platform provides a simple and intuitive setup to integrate advanced machine learning techniques into our use case of fraud detection. With KNIME, you can implement advanced algorithms such as Isolation Forests with minimal coding and out-of-the-box support through KNIME Extensions. This article is part of a series on Fraud Detection, in the next article, we will be covering our last method in outlier detection using distributions.

hub