Why Your ML Model Works in Training But Fails in Production

Contents

Time Travel: An Assumption Leak Feature Defaults That Become Signals Population Shift Without Distribution Shift Before You Go References & Further Reading

, I worked on real-time fraud detection systems and recommendation models for product companies that looked excellent during development. Offline metrics were strong. AUC curves were stable across validation windows. Feature importance plots told a clean, intuitive story. We shipped with confidence.

A few weeks later, our metrics started to drift.

Click-through rates on recommendations began to slide. Fraud models behaved inconsistently during peak hours. Some decisions felt overly confident, others oddly blind. The models themselves had not degraded. There were no sudden data outages or broken pipelines. What failed was our understanding of how the system behaved once it met time, latency, and delayed truth in the real world.

This article is about those failures. The quiet, unglamorous problems that show up only when machine learning systems collide with reality. Not optimizer choices or the latest architecture. The problems that do not appear in notebooks, but surface at 3 a.m. dashboards.

My message is simple: most production ML failures are data and time problems, not modeling problems. If you do not design explicitly for how information arrives, matures, and changes, the system will quietly make those assumptions for you.

Time Travel: An Assumption Leak

Time travel is the most common production ML failure I have seen, and also the least discussed in concrete terms. Everyone nods when you mention leakage. Very few teams can point to the exact row where it happened.

Let me make it explicit.

Imagine a fraud dataset with two tables:

transactions: when the payment happened

The transactions table shows a user making multiple payments on December 24th, all before mid afternoon.(Image by author, generated using synthetic data for illustration)

chargebacks: when the fraud outcome was reported

*The chargeback table shows a fraud report arriving at 6:40 PM the same day.*(Image by author, generated using synthetic data for illustration)

The feature we want is user_chargeback_count_last_30_days.

The batch job runs at the end of the day, just before midnight, and computes chargeback counts for the last 30 days. For user U123, the count is 1. As of midnight, that is factually correct.

Image by author, generated using synthetic data for illustration

Now look at the final joined training dataset.

Morning transactions at 9:10 AM and 11:45 AM already carry a chargeback count of 1. At the time those payments were made, the chargeback had not yet been reported. But the training data does not know that. Time has been flattened.

This is where the model cheats.

From the model’s perspective, risky looking transactions already come with confirmed fraud signals. Offline recall improves dramatically. Nothing looks wrong at this point.

But in production, the model is supposed to never sees the future.

When deployed, those early transactions do not have a chargeback count yet. The signal disappears and performance collapses.

This is not a modeling mistake. It is an assumption leak.

The hidden assumption is that a daily batch feature is valid for all events on that day. It is not. A feature is only valid if it could have existed at the exact moment the prediction was made.

Every feature must answer one question:

“Could this value have existed at the exact moment the prediction was made?”

If the answer is not a confident yes, the feature is invalid.

Feature Defaults That Become Signals

After time travel, this is a very common failure reason that I have seen in production systems. Unlike leakage, this one does not rely on the future. It relies on silence.

Most engineers treat missing values as a hygiene problem. Fill them with average, median or some other imputation technique and then move on.

These defaults feel harmless. Something safe enough so the model can keep running.

That assumption turns out to be expensive.

In real systems, missing rarely means random. Missing often means new, unknown, not yet observed, or not yet trusted. When we collapse all of that into a single default value, the model does not see a gap. It sees a pattern.

Let me make this concrete.

I first ran into this in a real time fraud system where we used a feature called avg_transaction_amount_last_7_days. For active users, this value was well behaved. For new or inactive users, the feature pipeline returned a default value of zero.

To illustrate how the default value became a strong proxy for user status, I computed the observed fraud rate grouped by the feature’s value:

data.groupby("avg_txn_amount_last_7_days")["is_fraud"].mean()

As shown, users with a value of zero exhibit a markedly lower fraud rate—not because zero spending is inherently safe, but because it implicitly encodes “new or inactive user.”

All users with an average transaction amount of zero are non fraud. Not because zero is inherently safe, but because those users are new/inactive. The model does not learn “low spending is safe”. It learns “missing history means safe”.

The default has become a signal.

During training, this looks good as precision improves. Then production traffic changes.

A downstream service starts timing out during peak hours. Suddenly, active users temporarily lose their history features. Their avg_transaction_amount_last_7_days flips to zero. The model confidently marks them as low risk.

Experienced teams handle this differently. They separate absence from value, track feature availability explicitly. Most importantly, they never allow silence to masquerade as information.

Population Shift Without Distribution Shift

This failure mode took me much longer to recognize, mostly because all the usual alarms stayed silent.

When people talk about data drift, they usually mean distribution shift. Feature histograms move. Percentiles change. KS tests light up dashboards. Everyone understands what to do next. Investigate upstream data, retrain, recalibrate.

Population shift without distribution shift is different. Here, the feature distributions remain stable. Summary statistics barely move. Monitoring dashboards look reassuring. And yet, model behavior degrades steadily.

I first encountered this in a large scale payments risk system that operated across multiple user segments. The model consumed transaction level features like amount, time of day, device signals, velocity counters, and merchant category codes. All of these features were heavily monitored. Their distributions barely changed month over month.

Still, fraud rates started creeping up in a very specific slice of traffic. What changed was not the data. It was who the data represented.

Over time, the product expanded into new user cohorts. New geographies with different payment habits. New merchant categories with unfamiliar transaction patterns. Promotional campaigns that brought in users who behaved differently but still fell within the same numeric ranges. From a distribution perspective, nothing looked unusual. But the underlying population had shifted.

The model had been trained mostly on mature users with long behavioral histories. As the user base grew, a larger fraction of traffic came from newer users whose behavior looked statistically similar but semantically different. A transaction amount of 2,000 meant something very different for a long tenured user than for someone on their first day. The model did not know that, because we had not taught it to care.

*Population shift without distribution shift*

See this figure above. It shows why this failure mode is difficult to detect in practice. The first two plots show transaction amount and short-term velocity distributions for mature and new users. From a monitoring perspective, these features appear stable with the overlap. If this were the only signal available, most teams would conclude that the data pipeline and model inputs remain healthy.

The third plot reveals the real problem. Even though the feature distributions are nearly identical, the fraud rate differs substantially across populations. The model applies the same decision boundaries to both groups because the inputs look familiar, but the underlying risk is not the same. What has changed is not the data itself, but who the data represents.

As traffic composition changes through growth or expansion those assumptions stop holding, even though the data continues to look statistically normal. Without explicitly modeling population context or evaluating performance across cohorts, these failures remain invisible until business metrics begin to degrade.

Before You Go

None of the failures in this article were caused by bad models.

The architectures were reasonable. The features were thoughtfully designed. What failed was the system around the model, specifically the assumptions we made about time, absence, and who the data represented.

Time is not a static index. Labels arrive late. Features mature unevenly. Batch boundaries rarely align with decision moments. When we ignore that, models learn from information they will never see again.

If there is one takeaway, it is this: strong offline metrics are not proof of correctness. They are proof that the model fits the assumptions you gave it. The real work of machine learning begins when those assumptions meet reality.

Design for that moment.

References & Further Reading

[1] ROC Curves and AUC (Google Machine Learning Crash Course)
https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc

[2] Kolmogorov–Smirnov Test (Wikipedia)
https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test[3] Data Distribution Shifts and Monitoring (Huyen Chip)
https://huyenchip.com/2022/02/07/data-distribution-shifts-and-monitoring.html