Fraud Detection with Generative Adversarial Nets (GANs) | by Michio Suginoo

Finally, it’s time for us to use GANs for data augmentation.

So how many synthetic data do we need to create?

First of all, our interest for the data augmentation is only for the model training. Since the test dataset is out-of-sample data, we want to preserve the original form of the test dataset. Secondly, because our intention is to transform the imbalanced dataset perfectly, we do not want to augment the majority class of non-fraud cases.

Simply put, we want to augment only the train dataset of the minority fraud class, nothing else.

Now, let’s split the working dataframe into the train dataset and the test dataset in 80/20 ratio, using a stratified data split method.

# Separate features and target variable
X = df.drop('Class', axis=1)
y = df['Class']# Splitting data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
# Combine the features and the label for the train dataset 
train_df = pd.concat([X_train, y_train], axis=1)

As a result, the shape of the train dataset is as follows:

train_df.shape = (226010, 7)

Let’s see the composition (the fraud cases and the non-fraud cases) of the train dataset.

# Load the dataset (fraud and non-fraud data)
fraud_data = train_df[train_df['Class'] == 1].drop('Class', axis=1).values
non_fraud_data = train_df[train_df['Class'] == 0].drop('Class', axis=1).values# Calculate the number of synthetic fraud samples to generate
num_real_fraud = len(fraud_data)
num_synthetic_samples = len(non_fraud_data) - num_real_fraud
print("# of non-fraud: ", len(non_fraud_data))
print("# of Real Fraud:", num_real_fraud)
print("# of Synthetic Fraud required:", num_synthetic_samples)
# of non-fraud:  225632
# of Real Fraud: 378
# of Synthetic Fraud required: 225254

This tells us that the train dataset (226,010) is comprised of 225,632 non-fraud data and 378 fraud data. In other words, the difference between them is 225,254. This number is the number of the synthetic fraud data (num_synthetic_samples) that we need to augment in order to perfectly match the numbers of these two classes within the train dataset: as a reminder, we do preserve the original test dataset.

Next, let’s code GANs.

First, let’s create custom functions to determine the two agents: the discriminator and the generator.

For the generator, I create a noise distribution function, build_generator(), which requires two parameters: latent_dim (the dimension of the noise) as the shape of its input; and the shape of its output, output_dim, which corresponds to the number of the features.

# Define the generator network
def build_generator(latent_dim, output_dim):
model = Sequential()
model.add(Dense(64, input_shape=(latent_dim,)))
model.add(Dense(128, activation='sigmoid'))
model.add(Dense(output_dim, activation='sigmoid'))
return model

For the discriminator, I create a custom function build_discriminator() that takes input_dim, which corresponds to the number of the features.

# Define the discriminator network
def build_discriminator(input_dim):
model = Sequential()
model.add(Input(input_dim))
model.add(Dense(128, activation='sigmoid'))
model.add(Dense(1, activation='sigmoid'))
return model

Then, we can call these function to create the generator and the discriminator. Here, for the generator I arbitrarily set latent_dim to be 32: you can try other value here, if you like.

# Dimensionality of the input noise for the generator
latent_dim = 32# Build generator and discriminator models
generator = build_generator(latent_dim, fraud_data.shape[1])
discriminator = build_discriminator(fraud_data.shape[1])

At this stage, we need to compile the discriminator, which is going to be nested in the main (higher) optimization loop later. And we can compile the discriminator with the following argument setting.

the loss function of the discriminator: the generic cross-entropy loss function for a binary classifier
the evaluation metrics: precision and recall.

# Compile the discriminator model
from keras.metrics import Precision, Recall
discriminator.compile(optimizer=Adam(learning_rate=0.0002, beta_1=0.5), loss='binary_crossentropy',  metrics=[Precision(), Recall()])

For the generator, we will compile it when we construct the main (upper) optimization loop.

At this stage, we can define the custom objective function for the generator as follows. Remember, the recommended objective was to maximize the following formula:

def generator_loss_log_d(y_true, y_pred):
return - K.mean(K.log(y_pred + K.epsilon()))

Above, the negative sign is required, since the loss function by default is designed to be minimized.

Then, we can construct the main (upper) loop, build_GANs(generator, discriminator), of the bi-level optimization architecture. In this main loop, we compile the generator implicitly. In this context, we need to use the custom objective function of the generator, generator_loss_log_d, when we compile the main loop.

As aforementioned, we need to freeze the discriminator when we train the generator.

# Build and compile the GANs upper optimization loop combining generator and discriminator
def build_gan(generator, discriminator):
discriminator.trainable = False
model = Sequential()
model.add(generator)
model.add(discriminator)
model.compile(optimizer=Adam(learning_rate=0.0002, beta_1=0.5), loss=generator_loss_log_d)return model
# Call the upper loop function
gan = build_gan(generator, discriminator)

At the last line above, gan calls build_gan() in order to implement the batch training below, using Keras’ model.train_on_batch() method.

As a reminder, while we train the discriminator, we need to freeze the training of the generator; and while we train the generator, we need to freeze the training of the discriminator.

Here is the batch training code incorporating the alternating training process of these two agents under the bi-level optimization framework.

# Set hyperparameters
epochs = 10000
batch_size = 32# Training loop for the GANs
for epoch in range(epochs):
# Train discriminator (freeze generator)
discriminator.trainable = True
generator.trainable = False
# Random sampling from the real fraud data
real_fraud_samples = fraud_data[np.random.randint(0, num_real_fraud, batch_size)]
# Generate fake fraud samples using the generator
noise = np.random.normal(0, 1, size=(batch_size, latent_dim))
fake_fraud_samples = generator.predict(noise)
# Create labels for real and fake fraud samples
real_labels = np.ones((batch_size, 1))
fake_labels = np.zeros((batch_size, 1))
# Train the discriminator on real and fake fraud samples
d_loss_real = discriminator.train_on_batch(real_fraud_samples, real_labels)
d_loss_fake = discriminator.train_on_batch(fake_fraud_samples, fake_labels)
d_loss = 0.5 * np.add(d_loss_real, d_loss_fake)
# Train generator (freeze discriminator)
discriminator.trainable = False
generator.trainable = True
# Generate synthetic fraud samples and create labels for training the generator
noise = np.random.normal(0, 1, size=(batch_size, latent_dim))
valid_labels = np.ones((batch_size, 1))
# Train the generator to generate samples that "fool" the discriminator
g_loss = gan.train_on_batch(noise, valid_labels)
# Print the progress
if epoch % 100 == 0:
print(f"Epoch: epoch - D Loss: d_loss - G Loss: g_loss")

Here, I have a quick question for you.

Below we have an excerpt associated with the generator training from the code above.

Can you explain what this code is doing?

# Generate synthetic fraud samples and create labels for training the generator
noise = np.random.normal(0, 1, size=(batch_size, latent_dim))
valid_labels = np.ones((batch_size, 1))

In the first line, noise generates the synthetic data. In the second line, valid_labels assigns the label of the synthetic data.

Why do we need to label it with 1, which is supposed to be the label for the real data? Didn’t you find the code counter-intuitive?

Ladies and gentlemen, welcome to the world of counterfeiters.

This is the labeling magic that trains the generator to create samples that can fool the discriminator.

Now, let’s use the trained generator to create the synthetic data for the minority fraud class.

# After training, use the generator to create synthetic fraud data
noise = np.random.normal(0, 1, size=(num_synthetic_samples, latent_dim))
synthetic_fraud_data = generator.predict(noise)# Convert the result to a Pandas DataFrame format
fake_df = pd.DataFrame(synthetic_fraud_data, columns=features.to_list())

Finally, the synthetic data is created.

In the next section, we can combine this synthetic fraud data with the original train dataset to make the entire train dataset perfectly balanced. I hope that the perfectly balanced training dataset would improve the performance of the fraud detection classification model.

Repeatedly, the use of GANs in this project is exclusively for data augmentation, but not for classification.

First of all, we would need the benchmark model as the basis of the comparison in order for us to evaluate the improvement made by the GANs based data augmentation on the performance of the fraud detection model.

As a binary classifier algorithm, I selected Ensemble Method for building the fraud detection model. As the benchmark scenario, I developed a fraud detection model only with the original imbalanced dataset: thus, without data augmentation. Then, for the second scenario with data augmentation by GANs, I can train the same algorithm with the perfectly balanced train dataset, which contains the synthetic fraud data created by GANs.

Benchmark Scenario: Ensemble Classifier without data augmentation
GANs Scenario: Ensemble Classifier with data augmentation by GANs

Benchmark Scenario: Ensemble without data augmentation

Next, let’s define the benchmark scenario (without data augmentation). I decided to select Ensemble Classifier: voting method as the meta learner with the following 3 base learners.

Gradient Boosting
Decision Tree
Random Forest

Since the original dataset is highly imbalanced, rather than accuracy I shall select evaluation metrics from the following 3 options: precision, recall, and F1-Score.

The following custom function, ensemble_training(X_train, y_train), defines the training and validation process.

def ensemble_training(X_train, y_train):

  # Initialize base learners
gradient_boosting = GradientBoostingClassifier(random_state=42)
decision_tree = DecisionTreeClassifier(random_state=42)
random_forest = RandomForestClassifier(random_state=42)  # Define the base models
base_models = 
'RandomForest': random_forest,
'DecisionTree': decision_tree,
'GradientBoosting': gradient_boosting
  # Initialize the meta learner
meta_learner = VotingClassifier(estimators=[(name, model) for name, model in base_models.items()], voting='soft')  # Lists to store training and validation metrics
train_f1_scores = []
val_f1_scores = []  # Splitting the train set further into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=42, stratify=y_train)  # Training and validation
for model_name, model in base_models.items():
model.fit(X_train, y_train)    # Training metrics
train_predictions = model.predict(X_train)
train_f1 = f1_score(y_train, train_predictions)
train_f1_scores.append(train_f1)    # Validation metrics using the validation set
val_predictions = model.predict(X_val)
val_f1 = f1_score(y_val, val_predictions)
val_f1_scores.append(val_f1)  # Training the meta learner on the entire training set
meta_learner.fit(X_train, y_train)  return meta_learner, train_f1_scores, val_f1_scores, base_models

The next function block, ensemble_evaluations(meta_learner, X_train, y_train, X_test, y_test), calculates the performance evaluation metrics at the meta learner level.

def ensemble_evaluations(meta_learner,X_train, y_train, X_test, y_test):
# Metrics for the ensemble model on both traininGANsd test datasets
ensemble_train_predictions = meta_learner.predict(X_train)
ensemble_test_predictions = meta_learner.predict(X_test)

  # Calculating metrics for the ensemble model
ensemble_train_f1 = f1_score(y_train, ensemble_train_predictions)
ensemble_test_f1 = f1_score(y_test, ensemble_test_predictions)  # Calculate precision and recall for both training and test datasets
precision_train = precision_score(y_train, ensemble_train_predictions)
recall_train = recall_score(y_train, ensemble_train_predictions)  precision_test = precision_score(y_test, ensemble_test_predictions)
recall_test = recall_score(y_test, ensemble_test_predictions)  # Output precision, recall, and f1 score for both training and test datasets
print("Ensemble Model Metrics:")
print(f"Training Precision: precision_train:.4f, Recall: recall_train:.4f, F1-score: ensemble_train_f1:.4f")
print(f"Test Precision: precision_test:.4f, Recall: recall_test:.4f, F1-score: ensemble_test_f1:.4f")  return ensemble_train_predictions, ensemble_test_predictions, ensemble_train_f1, ensemble_test_f1, precision_train, recall_train, precision_test, recall_test

Below, let’s look at the performance of the benchmark Ensemble Classifier.

Training Precision: 0.9811, Recall: 0.9603, F1-score: 0.9706
Test Precision: 0.9351, Recall: 0.7579, F1-score: 0.8372

At the meta-learner level, the benchmark model generated F1-Score at a reasonable level of 0.8372.

Next, let’s move on to the scenario with data augmentation using GANs . We want to see if the performance of the scenario with GAN can outperform the benchmark scenario.

GANs Scenario: Fraud Detection with data augmentation by GANs

Finally, we have constructed a perfectly balanced dataset by combining the original imbalanced train dataset (both non-fraud and fraud cases), train_df, and the synthetic fraud dataset generated by GANs, fake_df. Here, we will preserve the test dataset as original by not involving it in this process.

wdf = pd.concat([train_df, fake_df], axis=0)

We will train the same ensemble method with the mixed balanced dataset to see if it will outperform the benchmark model.

Now, we need to split the mixed balanced dataset into the features and the label.

X_mixed = wdf[wdf.columns.drop("Class")]
y_mixed = wdf["Class"]

Remember, when I ran the benchmark scenario earlier, I already defined the necessary custom function blocks to train and evaluate the ensemble classifier. I can use those custom functions here as well to train the same Ensemble algorithm with the combined balanced data.

We can pass the features and the label (X_mixed, y_mixed) into the custom Ensemble Classifier function ensemble_training().

meta_learner_GANs, train_f1_scores_GANs, val_f1_scores_GANs, base_models_GANs=ensemble_training(X_mixed, y_mixed)

Finally, we can evaluate the model with the test dataset.

ensemble_evaluations(meta_learner_GANs, X_mixed, y_mixed, X_test, y_test)

Here is the result.

Ensemble Model Metrics:
Training Precision: 1.0000, Recall: 0.9999, F1-score: 0.9999
Test Precision: 0.9714, Recall: 0.7158, F1-score: 0.8242