Your Synthetic Data Passed Every Test and Still Broke Your Model

Contents

The Three Metric Framework and Why It Misleads Practitioners Problem #1: Fidelity Metrics Evaluate Marginal Distributions, Not Interactions Between Features Problem 2: TSTR Utility Scores Conceal Tail Behavior When They Only Represent Average Performance Problem 3: Privacy metrics should treat all features equally when they shouldn’t.The Unified Evaluation Framework Dimension, Standard Metric, What It Misses and Augmented Check The Right Threshold Depends on Your Use Case Closing: The Quality Gap Is A Measurement Gap

looked solid. The KL divergence was well within acceptable ranges. On the Train on Synthetic, Test on Real (TSTR) test, the model achieved an accuracy of 91% when trained on the synthetic data and tested on the real data, which was slightly lower than the 93% obtained when using the actual data a difference well within the limits the team had established for the data tolerances. In addition, the membership inference risk was relatively low. The synthetic data set was certified as safe for use in machine learning model training; the real data was stored securely; and the model was trained.

However, three months later, the fraud-detection model was failing to detect classes of transactions it had previously detected without fail, not just degrading in performance, but failing completely. A whole group of edge-case behaviors had been effectively removed from the model’s reality.

Upon investigating the issue, the team could find no technical errors with the synthetic data. All of the metrics that the team ran continued to pass.

But the problem was that none of those metrics were actually measuring what truly mattered.

The Three Metric Framework and Why It Misleads Practitioners

The fidelity-utility-privacy triangle has become the standard lexicon for synthetic data quality evaluation and for good reason. It captures the three aspects of quality you really want to achieve: does the synthetic data resemble the real data (fidelity); does the synthetic data train models that behave similarly to models trained on real data (utility); and does the synthetic data protect the identity of the individuals from whom the data originated (privacy)?

The framework itself is sound. However, the execution of this framework is where issues arise.

Image by Author using LLM

Most practitioners evaluate the three quality metrics sequentially, treating the successful completion of each one as sufficient for deployment. This approach is flawed for three interrelated reasons that require a detailed explanation:

Problem #1: Fidelity Metrics Evaluate Marginal Distributions, Not Interactions Between Features

The fidelity metrics most frequently used KL Divergence, Kolmogorov-Smirnov Test, Total Variation Distance, Wasserstein Distance all measure the degree to which the individual feature distribution in the synthetic data set compares to the original.
None of these measures assess how features correlate with each other.

This is a subtle yet critical distinction. For example, consider a healthcare data set where the synthetic version accurately reproduces the marginal distributions of patient age and illness severity the marginal distributions appear virtually indistinguishable. But there is a minor discrepancy in the correlation between the two features in the synthetic data. As a result, when a model is trained on it, the model identifies the appropriate signals separately, but the wrong interaction among the signals.

In 2025, a peer-reviewed study on synthetic patient data evaluated five generative models on three clinical data sets. The results showed that although the marginal distributions were almost always very similar, the correlation scores differed by 20 points or more. The downstream effects were dramatic: on one data set, models trained on synthetic data yielded area under curve (AUC) values of approximately 0.80, whereas AUC values of approximately 0.88 were obtained when using the real data. The variable that determined whether it was the former or latter was the preservation of correlation rather than fidelity of the marginal distribution.

To address this: Run KS and KL tests as a baseline to confirm the similarity of marginal distributions. Always include a comparison of the correlation matrices. Compute the Frobenius norm of the difference to obtain a single value representing the amount of correlation structure that has been lost. Establish a threshold for correlation structure loss prior to synthesizing the data, not after.

import numpy as np
import pandas as pd
def correlation_drift_score(real_df: pd.DataFrame, synthetic_df: pd.DataFrame) -> float:
“””
Computes the Frobenius norm of the difference between
real and synthetic correlation matrices.
Lower is better. A score above 0.5 warrants investigation.
“””
real_corr = real_df.corr().fillna(0).values
synth_corr = synthetic_df.corr().fillna(0).values
return np.linalg.norm(real_corr — synth_corr, ‘fro’)
score = correlation_drift_score(real_df, synthetic_df)
print(f”Correlation Drift Score: {score:.4f}”)

One number. Run it every time. If it’s above your threshold, go back to the generator before you do anything else.

Problem 2: TSTR Utility Scores Conceal Tail Behavior When They Only Represent Average Performance

Train on Synthetic Data, Test on Real Data is one of the “gold standards” utility metrics, and it deserves the reputation it has earned. Training a model on synthetic data and having it perform well on real data is certainly meaningful evidence of utility.

However, TSTR scores are based on averages. Thus, they hide exactly what will break in production. In the fraud detection example at the beginning of this article, the overall TSTR AUC was 91%, but, when performance was broken down by transaction volume decile, the lowest performing decile (the rarest, highest value transactions) fell to 67% (the synthetic data generated common transactions very accurately; however, the synthetic data did not represent the rarest or most unusual scenarios as accurately). Therefore, the model that was trained on synthetic data learned the most common behaviors with extreme accuracy and learned the least common behavior with little accuracy.

This is the tail loss problem. It is formally addressed in the model collapse literature (Alemohammad et al., 2024, ICLR), and can be applied to any type of synthetic data generation process: Generative models that are optimized to generate high probability areas of the distribution progressively underrepresent rare events. The synthetic data generator does not attempt to underrepresent rare events — it simply represents the mathematics of how these models learn.

Fix: Do not report the TSTR on an aggregate basis. Report the TSTR separately for each of the deciles that you have stratified your target variable into. The deciles where synthetic-trained performance declines most sharply from real-trained performance will indicate to you exactly which deciles your synthetic data is not representing accurately.

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
import pandas as pd
import numpy as np
def tstr_by_decile(
real_train: pd.DataFrame,
synthetic_train: pd.DataFrame,
real_test: pd.DataFrame,
target_col: str,
n_deciles: int = 10
) -> pd.DataFrame:
“””
Runs TSTR evaluation stratified by deciles of the target variable.
Returns a comparison dataframe for real vs synthetic training performance.
“””
results = []
real_test = real_test.copy()
real_test[‘decile’] = pd.qcut(
real_test[target_col], q=n_deciles, labels=False, duplicates=’drop’
)
feature_cols = [c for c in real_train.columns if c != target_col]
for label, train_df in [(“Real”, real_train), (“Synthetic”, synthetic_train)]:
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(train_df[feature_cols], train_df[target_col])
for decile_id, group in real_test.groupby(‘decile’):
if len(group[target_col].unique()) < 2:
continue
score = roc_auc_score(
group[target_col],
clf.predict_proba(group[feature_cols])[:, 1]
)
results.append({
‘Train Source’: label,
‘Decile’: decile_id,
‘AUC-ROC’: round(score, 4)
})
return pd.DataFrame(results).pivot(
index=’Decile’, columns=’Train Source’, values=’AUC-ROC’
)
decile_results = tstr_by_decile(real_train, synthetic_train, real_test, ‘fraud_flag’)
print(decile_results)

Problem 3: Privacy metrics should treat all features equally when they shouldn’t.

Membership inference risk is the most common privacy metric. Membership inference risk asks one question: Can an attacker determine if a particular record exists in the training data set? Low scores are really good news.

However, this is measured at the record level, therefore it measures the risk of identifying a record as a whole. The riskier type of attack is attribute inference; Given publicly available information about an individual’s features, can an attacker identify a sensitive attribute in the synthetic data? This is the attack model that regulators are concerned with under GDPR’s re-identification standard and it works at the combination level (not the record level).

A Consensus Privacy Metrics Framework (Pilgram et al., 2025) defined three distinct types of risks: Singling Out (Identify a single individual), Linkability (Link Records Across Data Sets), and Inference (Deduce Sensitive Attributes from Combinations of Quasi-Identifiers). Practitioners almost exclusively measure the first type (Singling Out). The third type of risk (Inference) is the area where sensitive data actually leaks and is completely invisible to standard membership inference scoring.

Fix: Prioritize your features based upon sensitivity prior to synthesizing. Categorize them into public (features that may be included in the synthetic data without restriction), quasi-identifiers (combinations of public features that may enable linkage), and sensitive (the attributes you are attempting to protect). Measure membership inference risk only for the sensitive features and do not include the entire data set. Next, perform an attribute inference test: Train an external model to predict each sensitive feature based upon the quasi-identifiers using synthetic data. Compare the accuracy of the trained model to a model that has been trained to predict each sensitive feature based upon the same quasi-identifiers but using held-out data. If the difference in accuracy is small, then your synthetic data is leaking.

from sklearn.ensemble import GradientBoostingClassifier
def attribute_inference_risk(
synthetic_df: pd.DataFrame,
real_test_df: pd.DataFrame,
quasi_identifiers: list,
sensitive_feature: str
) -> dict:
“””
Estimates attribute inference risk by checking how well
a model trained on synthetic data predicts a sensitive feature
using only quasi-identifiers.
High accuracy on real test data = synthetic data is leaking
information about the sensitive attribute.
“””
clf = GradientBoostingClassifier(random_state=42)
clf.fit(synthetic_df[quasi_identifiers], synthetic_df[sensitive_feature])
real_accuracy = clf.score(
real_test_df[quasi_identifiers],
real_test_df[sensitive_feature]
)
majority_class_accuracy = (
real_test_df[sensitive_feature].value_counts(normalize=True).max()
)
lift = real_accuracy — majority_class_accuracy
return {
“inference_accuracy_on_real”: round(real_accuracy, 4),
“baseline_accuracy”: round(majority_class_accuracy, 4),
“inference_lift”: round(lift, 4),
“risk_level”: “HIGH” if lift > 0.10 else “MODERATE” if lift > 0.05 else “LOW”
}
risk = attribute_inference_risk(
synthetic_df, real_test_df,
quasi_identifiers=[‘age_band’, ‘region’, ‘employment_status’],
sensitive_feature=’income_bracket’
)
print(risk)

When you see an “above 0.10” (or whatever number) for lift, this means your synthetic dataset is a better teacher about identifying your users’ sensitive attributes than randomness. It doesn’t matter if your Membership Inference Score (MIS) is below 0.10 or whatever the threshold may be; that is irrelevant.

The Unified Evaluation Framework

As stated earlier, these three challenges are essentially one challenge: they each stem from using metrics intended to evaluate the characteristics of a data set and then using those same metrics as the basis for certification of a data set for production deployment. These are two very different tasks.

Below is a complete checklist of evaluations that address each of the gaps in evaluation:

Dimension, Standard Metric, What It Misses and Augmented Check

Fidelity

KL Divergence, KS Test
Correlation structure between features
Correlation Drift Score (Frobenius norm)

Utility

TSTR average AUC
Tail performance on rare events
TSTR stratified by target decile

Privacy

Membership Inference Risk
Attribute inference via quasi-identifiers
Attribute Inference Lift test

The Right Threshold Depends on Your Use Case

The most overlooked aspect of the FCA-ICO-Alan Turing Institute roundtable on synthetic data validation was this: “Zero risk = Zero Utility.” Synthetic data cannot be completely private and as useful at the same time. The question is no longer “Does the data pass?” The question is “Do the trade-offs meet the needs of the use cases?”

The synthetic data used for QA testing internally for an application requires high fidelity and structural accuracy. However, since access to the data is controlled, there is less emphasis on privacy.

On the other hand, the data you release to external users, across organizations, to regulators, or for research purposes must have higher privacy guarantees. In such cases, you can accept lower statistical fidelity in the synthetic data.

Therefore, when developing your evaluation framework, define the use case before you evaluate your synthetic data. Answer the following questions before you generate synthetic data:

1) Who will have access to this synthetic dataset and under what conditions? This establishes your threshold for privacy.

2) What downstream task will this data train or test? This defines the utility metrics that are load-bearing vs. noise.

3) Are there any feature(s) required for the downstream task? If so, this defines where you must preserve fidelity and where you can tolerate variance.

Establish these thresholds based upon your answers to the above questions before running the synthesis. Run your evaluations against your established thresholds (not the thresholds reported by the tool by default).

Closing: The Quality Gap Is A Measurement Gap

The fraud detection model did not fail due to the synthetic data being poor. The model failed because the team evaluated the wrong characteristics and made incorrect conclusions based upon their correct measurements. Fidelity, utility and privacy are the proper dimensions.

The standard metrics within each dimension are good starting points. However, they were developed to measure and describe data and were not developed to certify data for production use. To close this measurement gap requires three additional assessments that identify the gaps in the standard metrics; correlation drift, tail utility by decile, and attribute inference risk.

These three assessments do not require specialized tools. The three implementations described in this article run within standard scikit-learn and NumPy. The difficult work is not writing the code, it is asking the right questions prior to putting your model into production.