Machine Learning at Scale: Managing More Than One Model in Production

Contents

1. Leaving the Sandbox: The Strategy of Availability 2. The Monitoring Challenge And Why traditional metrics die at scale 3. What about The Engineering Wall 4. Be careful of Label Leakage 5. Finally, The Human Loop And a quick recap before you start working with ML at scale:

yourself how real machine learning products actually run in major tech companies or departments? If yes, this article is for you 🙂

Before discussing scalability, please don’t hesitate to read my first article on the basics of machine learning in production.

In this last article, I told you that I’ve spent 10 years working as an AI engineer in the industry. Early in my career, I learned that a model in a notebook is just a mathematical hypothesis. It only becomes useful when its output hits a user, a product, or generates money.

I’ve already shown you what “Machine Learning in Production” looks like for a single project. But today, the conversation is about Scale: managing tens, or even hundreds, of ML projects simultaneously. These last years, we have moved from the Sandbox Era into the Infrastructure Era. “Deploying a model” is now a no negotiable skill; the real challenge is ensuring a massive portfolio of models works reliably and safely.

1. Leaving the Sandbox: The Strategy of Availability

To understand ML at scale, you first need to leave the “Sandbox” mindset behind you. In a sandbox, you have static data and one model. If it drifts, you see it, you stop it, you fix it.

But once you transition to Scale Mode, you’re no longer managing a model, you’re managing a portfolio. This is where the CAP Theorem (Consistency, Availability, and Partition Tolerance) becomes your reality. In a single-model setup, you can try to balance the tradeoffs, but at scale, it’s impossible to be perfect across the 3 metrics. You must choose your battles, and more often than not, Availability becomes the top priority.

Why? Because when you have 100 models running, something is always breaking. If you stopped the service every time a model drifted, your product would be offline 50% of the time.

Since we cannot stop the service, we design models to fail “cleanly.” Take an example of a recommendation system: if its model gets corrupted data, it shouldn’t crash or show a “404 error.” It should fall back to a safe default setting (like showing the “Top 10 Most Popular” items). The user stays happy, the system stays available, even though the result is suboptimal. But to do this, you need to know when to trigger that fallback. And that leads us to our biggest challenge at scale…”The monitoring”.

2. The Monitoring Challenge And Why traditional metrics die at scale

By saying that at scale it’s important that our system fail “cleanly,” you might think that it’s easy and we just need to check or monitor the accuracy. But at scale, “Accuracy” is not enough and I will tell you exactly why:

The Lack of Human Consensus: In Computer Vision, for example, monitoring is easy because humans agree on the truth (it’s a dog or it’s not). But in a Recommendation System or an Ad-ranking model, there is no “Gold Standard.” If a user doesn’t click, is the model bad? Or is the user just not in the mood?
The Feature Engineering Trap: Because we can’t easily measure “truth” through a simple metric, we over-compensate. We add hundreds of features to the model, hoping that “more data” will solve the uncertainty.
The Theoretical Ceiling: We fight for 0.1% accuracy gains without knowing if the data is just too noisy to give more. We are chasing a “ceiling” we can’t see.

So let’s link all of that to understand where we are going and why this is important: Because monitoring “truth” is nearly impossible at scale (Dead Zones), we can’t rely on simple alerts to tell us to stop. This is exactly why we prioritize Availability and Safe Fallbacks, we assume the model might be failing without the metrics telling us, so we build a system that can survive that “fuzzy” failure.

3. What about The Engineering Wall

Now that we have discussed the strategy and monitoring challenges, we are not yet ready to scale, as we have not yet addressed the infrastructure aspect. Scaling requires engineering skills just as much as data science skills.

We cannot talk about scaling if we don’t have a solid, secure infrastructure. Because the models are complex, and because Availability is our number one priority, we need to think seriously about the architecture we set up.

At this stage, my honest advice is to surround yourself with a team or people who are used to building big infrastructures. You don’t necessarily need a massive cluster or a supercomputer, but you do need to think about these three execution basics:

Cloud vs. Device: A server gives you power and is easy to monitor, but it’s expensive. Your choice depends entirely on Cost vs. Control.
The Hardware: You simply can’t put every model on a GPU; you’d go bankrupt. You need a Tiered Strategy: run your simple “fallback” models on cheap CPUs, and reserve the expensive GPUs for the heavy “money-maker” models.
Optimization: At scale, a 1-second lag in your fallback mechanism is a failure. You aren’t just writing Python anymore; you must learn to compile and optimize your code for specific chips so the “Fail Cleanly” switch happens in milliseconds.

4. Be careful of Label Leakage

So, you’ve anticipated the failures, worked on availability, sorted the monitoring, and built the infrastructure. You probably think you’re finally ready to master scalability. Actually, not yet. There is an issue you simply can’t anticipate if you have never worked in a real environment.

Even if your engineering is perfect, Label Leakage can ruin your strategy and your systems that are running multiple models.

In a single project, you might spot leakage in a notebook. But at scale, where data comes from 50 different pipelines, leakage becomes almost invisible.

The Churn Example: Imagine you’re predicting which users will cancel their subscription. Your training data has a feature called Last_Login_Date. The model looks perfect with 99% F1 score.

But here’s what actually happened: The database team set up a trigger that “clears” the login date field the moment a user hits the “Cancel” button. Your model sees a “Null” login date and realizes, “Aha! They canceled!”

In the real world, at the exact millisecond the model needs to make a prediction before the user cancels, that field isn’t Null yet. The model is looking at the answer from the future.

This is a basic example just so you can understand the concept. But believe me, if you have a complex system with real-time predictions (which happens often with IoT), this is incredibly hard to detect. You can only avoid it if you are aware of the problem from the start.

My tips:

Feature Latency Monitoring: Don’t just monitor the value of the data, monitor when it was written vs. when the event actually happened.
The Millisecond Test: Always ask: “At the exact moment of prediction, does this specific database row actually contain this value yet?”

Of course, these are simple questions, but the best time to evaluate this is during the design phase, before you ever write a line of production code.

5. Finally, The Human Loop

The final piece of the puzzle is Accountability. At scale, our metrics are fuzzy, our infrastructure is complex, and our data is leaky, so we need a “Safety Net.”

Shadow Deployment: This is mandatory for scale. You deploy “Model B” but don’t show its results to users. You let it run “in the shadows” for a week, comparing its predictions to the “Truth” that eventually arrives. If it’s stable, only then do you promote it to “Live.”
Human-in-the-Loop: For high-stakes models, you need a small team to audit the “Safe Defaults.” If your system has fallen back to “Most Popular Items” for three days, a human needs to ask why the main model hasn’t recovered.

And a quick recap before you start working with ML at scale:

Since we can’t be perfect, we choose to stay online (Availability) and fail safely.
Availability is our metric number 1 since monitoring at scale is “fuzzy” and traditional metrics are unreliable.
We build the infrastructure (Cloud/Hardware) to make these safe failures fast.
We watch out for “cheating” data (Leakage) that makes our fuzzy metrics look too good to be true.
We use Shadow Deploys to prove the model is safe before it ever touches a customer.

And remember, your scale is only as good as your safety net. Don’t let your work be among the 87% of failed projects.

👉 LinkedIn: Sabrine Bendimerad

👉 Medium: https://medium.com/@sabrine.b end imerad1

👉 Instagram: https://tinyurl.com/datailearn