Abstract
datasets are extremely imbalanced, with positive rates below 0.2%. Standard neural networks trained with weighted binary cross-entropy often achieve high ROC-AUC but struggle to identify suspicious transactions under threshold-sensitive metrics. I propose a Hybrid Neuro-Symbolic (HNS) approach that incorporates domain knowledge directly into the training objective as a differentiable rule loss — encouraging the model to assign high fraud probability to transactions with unusually large amounts and atypical PCA signatures. On the Kaggle Credit Card Fraud dataset, the hybrid achieves ROC-AUC of 0.970 ± 0.005 across 5 random seeds, compared to 0.967 ± 0.003 for the pure neural baseline under symmetric evaluation. A key practical finding: on imbalanced data, threshold selection strategy affects F1 as much as model architecture — both models must be evaluated with the same approach for any comparison to be meaningful. Code and reproducibility materials are available at GitHub.
The Problem: When ROC-AUC Lies
I had a fraud dataset at 0.17% positive rate. Trained a weighted BCE network, got ROC-AUC of 0.96, someone said “nice”. Then I pulled up the score distributions and threshold-dependent metrics. The model had quietly figured out that predicting “not fraud” on anything ambiguous was the path of least resistance — and nothing in the loss function disagreed with that decision.
What bothered me wasn’t the math. It was that the model had no idea what fraud looks like. A junior analyst on day one could tell you: large transactions are suspicious, transactions with unusual PCA signatures are suspicious, and when both happen together, you should definitely be paying attention. That knowledge just… never makes it into the training loop.So I ran an experiment. What if I encoded that analyst intuition as a soft constraint directly in the loss function — something the network has to satisfy while also fitting the labels? The result was a Hybrid Neuro-Symbolic (HNS) setup. This article walks through the full experiment: the model, the rule loss, the lambda sweep, and — critically — what a proper multi-seed variance analysis with symmetric threshold evaluation actually shows.
The Setup
I used the Kaggle Credit Card Fraud dataset — 284,807 transactions, 492 of which are fraud (0.172%). The V1–V28 features are PCA components from an anonymized original feature space. Amount and Time are raw. The severe imbalance is the whole point; this is where standard approaches start to struggle [1].
Split was 70/15/15 train/val/test, stratified. I trained four things and compared them head-to-head:
- Isolation Forest — contamination=0.001, fits on the full training set
- One-Class SVM — nu=0.001, fits only on the non-fraud training samples
- Pure Neural — three-layer MLP with BCE + class weighting, no domain knowledge
- Hybrid Neuro-Symbolic — the same MLP, with a differentiable rule penalty added to the loss
Isolation Forest and One-Class SVM serve as a gut-check. If a supervised network with 199k training samples cannot clear the bar set by an unsupervised method, that is worth knowing before you write up results. A tuned gradient boosting model would likely outperform both neural approaches; this comparison is intended to isolate the effect of the rule loss, not benchmark against all possible methods. Full code for all four is on GitHub.
The Model
Nothing exotic. A three-layer MLP with batch normalization after each hidden layer. The batch norm matters more than you might expect — under heavy class imbalance, activations can drift badly without it [3].
class MLP(nn.Module):
def __init__(self, input_dim):
super().__init__()
self.net = nn.Sequential(
nn.Linear(input_dim, 128),
nn.ReLU(),
nn.BatchNorm1d(128),
nn.Linear(128, 64),
nn.ReLU(),
nn.BatchNorm1d(64),
nn.Linear(64, 1)
)
def forward(self, x):
return self.net(x)
For the loss, BCEWithLogitsLoss with pos_weight — computed as the ratio of non-fraud to fraud counts in the training set. On this dataset that is 577 [4]. A single fraud sample in a batch generates 577 times the gradient of a non-fraud one.
pos_weight = count(y=0) / count(y=1) ≈ 577
That weight provides a directional signal when labeled fraud does appear. But the model still has no concept of what “suspicious” looks like in feature space — it only knows that fraud examples, when they do show up, should be heavily weighted. That is different from knowing where to look on batches that happen to contain no labeled fraud at all.
The Rule Loss
Here is the core idea. Fraud analysts know two things empirically: unusually high transaction amounts are suspicious, and transactions that sit far from normal behavior in PCA space are suspicious. I want the model to assign high fraud probabilities to transactions that match both signals — even when a batch contains no labeled fraud examples.
The trick is making the rule differentiable. An if/else threshold — flag any transaction where amount > 1000 — is a hard step function. Its gradient is zero everywhere except at the threshold itself, where it is undefined. That means backpropagation has nothing to work with; the rule produces no useful gradient signal and the optimizer ignores it. Instead, I use a steep sigmoid centered at the batch mean. It approximates the same threshold behavior but stays smooth and differentiable everywhere — the gradient is small far from the boundary and peaks near it, which is exactly where you want the optimizer paying attention. The result is a smooth suspicion score between 0 and 1:
def rule_loss(x, probs):
# x[:, -1] = Amount (last column in creditcard.csv after dropping Class)
# x[:, 1:29] = V1–V28 (PCA components, columns 1–28)
amount = x[:, -1]
pca_norm = torch.norm(x[:, 1:29], dim=1)
suspicious = (
torch.sigmoid(5 * (amount - amount.mean())) +
torch.sigmoid(5 * (pca_norm - pca_norm.mean()))
) / 2.0
penalty = suspicious * torch.relu(0.6 - probs.squeeze())
return penalty.mean()
A note on why PCA norm specifically: the V1–V28 features are the result of a PCA transform applied to the original anonymized transaction data. A transaction that sits far from the origin in this compressed space has unusual variance across multiple original features simultaneously — it is an outlier in the latent representation. The Euclidean norm of the PCA vector captures that distance in a single scalar. This is not a Kaggle-specific trick. On any dataset where PCA components represent normal behavioral variance, the norm of those components is a reasonable proxy for atypicality. If your features are not PCA-transformed, you would replace this with a domain-appropriate signal — Mahalanobis distance, isolation score, or a feature-specific z-score.
The relu(0.6 – probs) term is the constraint: it fires only when the model’s predicted fraud probability is below 0.6 for a suspicious transaction. If the model is already confident (prob > 0.6), the penalty is zero. This is intentional — I am not penalizing the model for being too aggressive on suspicious transactions, only for being too conservative. The asymmetry means the rule can never fight against a correct high-confidence prediction.
Formally, the combined objective is:
L_total = L_BCE + λ · L_rule
L_rule = E[ σ_susp(x) · ReLU(0.6 − p) ]
σ_susp(x) = ½ · [ σ(5·(amount − ā)) + σ(5·(‖V₁₋₂₈‖ − mean‖V‖)) ]
The λ hyperparameter controls how hard the rule pushes. At λ=0 you get the pure neural baseline. The full training loop:
for xb, yb in train_loader:
xb, yb = xb.to(DEVICE), yb.to(DEVICE)
logits = model(xb)
bce = criterion(logits.squeeze(), yb)
probs = torch.sigmoid(logits)
rl = rule_loss(xb, probs)
loss = bce + lambda_rule * rl
optimizer.zero_grad()
loss.backward()
optimizer.step()
Tuning Lambda
Five values tested: 0.0, 0.1, 0.5, 1.0, 2.0. Each model trained to best validation PR-AUC with early stopping at patience=7, seed=42:
Lambda 0.0 → Val PR-AUC: 0.7580
Lambda 0.1 → Val PR-AUC: 0.7595
Lambda 0.5 → Val PR-AUC: 0.7620 ← best
Lambda 1.0 → Val PR-AUC: 0.7452
Lambda 2.0 → Val PR-AUC: 0.7504
Best Lambda: 0.5
λ=0.5 wins narrowly on validation PR-AUC. The gap between λ=0.0, 0.1, and 0.5 is small — within the range of seed variance as the multi-seed analysis below shows. The meaningful drop at λ=1.0 and 2.0 suggests that aggressive rule weighting can override the BCE signal rather than complement it. On new data, treat λ=0 as the default and verify any improvement holds across seeds before trusting it.
One thing to be careful about with threshold selection: I computed the optimal F1 threshold on the validation set and applied it to the test set — for both models symmetrically. On a 0.17% positive-rate dataset, the optimal decision boundary is nowhere near 0.5. Applying different thresholding strategies to different models means measuring the threshold gap, not the model gap. Both must use the same approach:
def find_best_threshold(y_true, probs):
precision, recall, thresholds = precision_recall_curve(y_true, probs)
f1_scores = 2*(precision*recall) / (precision+recall+1e-8)
return thresholds[np.argmax(f1_scores)]
# Applied symmetrically to BOTH models — val set only
hybrid_thresh, _ = find_best_threshold(y_val, hybrid_val_probs)
pure_thresh, _ = find_best_threshold(y_val, pure_val_probs)
Results
| Model | F1 | PR-AUC | ROC-AUC | Recall@1%FPR |
| Isolation Forest | 0.121 | 0.172 | 0.941 | 0.581 |
| One-Class SVM | 0.029 | 0.391 | 0.930 | 0.797 |
| Pure Neural (λ=0) | 0.776 | 0.806 | 0.969 | 0.878 |
| Hybrid (λ=0.5) | 0.767 | 0.745 | 0.970 | 0.878 |
On this seed, the hybrid and pure baseline are competitive on F1 (0.767 vs 0.776) and identical on Recall@1%FPR. The hybrid’s PR-AUC is lower on this particular seed (0.745 vs 0.806). The cleanest signal is ROC-AUC — 0.970 for the hybrid vs 0.969 for the pure baseline. ROC-AUC is threshold-independent, measuring ranking quality across all possible cutoffs. That edge is where the rule loss shows up most consistently.
Precision-Recall Curve
Strong early precision is what you want in a fraud system. The curve holds reasonably before dropping — meaning the model’s top-ranked transactions are genuinely fraud-heavy, not just a lucky threshold. In production you would tune the threshold to your actual cost ratio: the cost of a missed fraud versus the cost of a false alarm. The val-optimized F1 threshold used here is a reasonable middle ground for reporting, not the only valid choice.
Confusion Matrix

Score Distributions

This histogram is what I look at first after training any classifier on imbalanced data. The non-fraud distribution should spike near zero; the fraud distribution should spread toward 1. The overlap region in the middle is where the model is genuinely uncertain — that is where your threshold lives.
Variance Analysis — 5 Random Seeds
A single-seed result on a dataset this imbalanced is not enough to trust. I ran both models across seeds [42, 0, 7, 123, 2024], applying val-optimized thresholds symmetrically to both in every run:
Seed 42 | Hybrid F1: 0.767 PR-AUC: 0.745 | Pure F1: 0.776 PR-AUC: 0.806
Seed 0 | Hybrid F1: 0.733 PR-AUC: 0.636 | Pure F1: 0.788 PR-AUC: 0.743
Seed 7 | Hybrid F1: 0.809 PR-AUC: 0.817 | Pure F1: 0.767 PR-AUC: 0.755
Seed 123 | Hybrid F1: 0.797 PR-AUC: 0.756 | Pure F1: 0.757 PR-AUC: 0.731
Seed 2024 | Hybrid F1: 0.764 PR-AUC: 0.745 | Pure F1: 0.826 PR-AUC: 0.763
| Model | F1 (mean ± std) | PR-AUC (mean ± std) | ROC-AUC (mean ± std) |
| Pure Neural | 0.783 ± 0.024 | 0.760 ± 0.026 | 0.967 ± 0.003 |
| Hybrid (λ=0.5) | 0.774 ± 0.027 | 0.740 ± 0.058 | 0.970 ± 0.005 |

Three observations from the variance data. The hybrid wins on F1 in 2 of 5 seeds; the pure baseline wins in 3 of 5. Neither dominates on threshold-dependent metrics. The hybrid’s PR-AUC variance is notably higher (±0.058 vs ±0.026), meaning the rule loss makes some initializations better and some worse — it is a sensitivity, not a guaranteed improvement. The one result that holds without exception: ROC-AUC is higher for the hybrid across all 5 seeds. That is the cleanest signal from this experiment.
Why Does the Rule Loss Help ROC-AUC?
ROC-AUC is threshold-independent — it measures how well the model ranks fraud above non-fraud across all possible cutoffs. A consistent improvement across 5 seeds is a real signal. Here is what I think is happening.
With 0.172% fraud prevalence, most 2048-sample batches contain only 3–4 labeled fraud examples. The BCE loss receives almost no fraud-relevant gradient on the majority of batches. The rule loss fires on every suspicious transaction regardless of label — it generates gradient signals on batches that would otherwise tell the optimizer almost nothing about fraud. This gives the model consistent direction throughout training, not just on the rare batches where labeled fraud happens to appear.
The penalty is also feature-selective. By pointing the model specifically toward amount and PCA norm, the rule reduces the chance that the model latches onto irrelevant correlations in the other 28 dimensions. It functions as soft regularization over the feature space, not just the output space.
The one-sided relu matters too. I am not penalizing the model for being too aggressive on suspicious transactions — only for being too conservative. The rule cannot fight against a correct high-confidence prediction, only push up underconfident ones. That asymmetry is deliberate.
The lesson is not that rules replace learning. It is that rules can guide it — especially when labeled examples are scarce and you already know something about what you are looking for.
On Threshold Evaluation in Imbalanced Classification
One finding from this experiment is worth its own section because it applies to any imbalanced classification problem, not just fraud.
On a dataset with 0.17% positive rate, the optimal F1 threshold is nowhere near 0.5. A model can rank fraud almost perfectly and still score poorly on F1 at a default threshold, simply because the decision boundary needs to be calibrated to the class imbalance. This means that if two models are evaluated with different thresholding strategies — one at a fixed cutoff, the other with a val-optimized cutoff — you are not comparing models. You are measuring the threshold gap.
The practical checklist for clean comparison on imbalanced data:
- Both models evaluated with the same thresholding strategy
- Threshold selected on validation data, never on test data
- PR-AUC and ROC-AUC reported alongside F1 — both are threshold-independent
- Variance across multiple seeds to separate real differences from lucky initialization
Things to Watch Out For
Batch-relative statistics. The rule computes “high amount” and “high PCA norm” relative to the batch mean, not a fixed population statistic. During training with large batches (2048) and stratified sampling, batch means are stable enough. In online inference scoring individual transactions, freeze those statistics to training-set values. Otherwise the “suspicious” boundary shifts with every call.
PR-AUC variance increases with the rule loss. Hybrid PR-AUC ranges from 0.636 to 0.817 across seeds versus 0.731 to 0.806 for the pure baseline. A rule that helps on some initializations and hurts on others requires multi-seed validation before drawing conclusions. Single-seed results are not enough.
High λ degrades performance. λ=1.0 and 2.0 show a meaningful drop in validation PR-AUC. Aggressive rule weighting can override the BCE signal rather than complement it. Start at λ=0.5 and verify on your own data before going higher.
A natural extension would make the rule weights learnable rather than fixed at 0.5/0.5:
# Learnable combination weights
self.rule_w = nn.Parameter(torch.tensor([0.5, 0.5]))
w = torch.softmax(self.rule_w, dim=0)
suspicious = (
w[0] * torch.sigmoid(5 * (amount - amount.mean())) +
w[1] * torch.sigmoid(5 * (pca_norm - pca_norm.mean()))
)
This lets the model decide whether amount or PCA norm is more predictive for the specific data, rather than hard-coding equal weights. This variant has not been run yet — it is the next thing on the list.
Closing Thoughts
The rule loss does something real — the ROC-AUC improvement is consistent and threshold-independent across all 5 seeds. The improvement on threshold-dependent metrics like F1 and PR-AUC is within noise range and depends on initialization. The honest summary: domain rules injected into the loss function can improve a model’s underlying score distributions on rare-event data, but the magnitude depends heavily on how you measure it and how stable the improvement is across seeds.
If you work in fraud detection, anomaly detection, or any domain where labeled positives are rare and domain knowledge is rich, this pattern is worth experimenting with. The implementation is simple — a handful of lines on top of a standard training loop. The more important discipline is measurement: use symmetric threshold evaluation, report threshold-independent metrics, and always run multiple seeds before trusting a result.
The repo has the full training loop, lambda sweep, variance analysis, and eval code. Download the CSV from Kaggle, drop it in the same directory, run app.py. The numbers above should reproduce — if they do not on your machine, open an issue and I will take a look.
References
[1] A. Dal Pozzolo, O. Caelen, R. A. Johnson and G. Bontempi, Calibrating Probability with Undersampling for Unbalanced Classification (2015), IEEE SSCI. https://dalpozz.github.io/static/pdf/SSCI_calib_final_noCC.pdf
[2] ULB Machine Learning Group, Credit Card Fraud Detection Dataset (Kaggle). https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud (Open Database license)
[3] S. Ioffe and C. Szegedy, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (2015), arXiv:1502.03167. https://arxiv.org/abs/1502.03167
[4] PyTorch Documentation — BCEWithLogitsLoss. https://pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html
[5] Experiment code and reproducibility materials. https://github.com/Emmimal/neuro-symbolic-fraud-pytorch/
Disclosure
This article is based on independent experiments using publicly available data (Kaggle Credit Card Fraud dataset) and open-source tools (PyTorch). No proprietary datasets, company resources, or confidential information were used. The results and code are fully reproducible as described, and the GitHub repository contains the complete implementation. The views and conclusions expressed here are my own and do not represent any employer or organization.