Not All RecSys Problems Are Created Equal

Contents

Observable Outcomes and Catalog Stability High-Churn Catalog The Ubiquity of Feature-Based Models Subjectivity The Right Data Know Where You Stand

The industry’s outliers have distorted our definition of Recommender Systems. TikTok, Spotify, and Netflix employ hybrid deep learning models combining collaborative- and content-based filtering to deliver personalized recommendations you didn’t even know you’d like. If you’re considering a RecSys role, you might expect to dive into these right away. But not all RecSys problems operate — or need to operate — at this level. Most practitioners work with relatively simple, tabular models, often gradient-boosted trees. Until attending RecSys ’25 in Prague, I thought my experience was an outlier. Now I believe this is the norm, hidden behind the huge outliers that drive the industry’s state of the art. So what sets these giants apart from most other companies? In this article, I use the framework mapped in the image above to reason about these differences and help place your own recommendation work on the spectrum.

Most recommendation systems begin with a candidate generation phase, reducing millions of possible items to a manageable set that can be re-ranked by higher-latency solutions. But candidate generation isn’t always the uphill battle it’s made out to be, nor does it necessarily require machine learning. Contexts with well-defined scopes and hard filters often don’t require complex querying logic or vector search. Consider Booking.com: when a user searches for “4-star hotels in Barcelona, September 12-15,” the geography and availability constraints have already narrowed millions of properties down to a few hundred—even if the backend systems handling that filtering are themselves complex. The real challenge for machine learning practitioners is then ranking these hotels with precision. This is vastly different from Amazon’s product search or the YouTube homepage, where hard filters are absent. In these environments, the system must rely on semantic intent or past behavior to surface relevant candidates from millions or billions of items before re-ranking even takes place.

Beyond candidate generation, the complexity of re-ranking is best understood through the two dimensions mapped in the image below. First, observable outcomes and catalog stability, which determines how strong a baseline you can have. Second, the subjectivity of preferences and their learnability, which determines how complex your personalization solution has to be.

Observable Outcomes and Catalog Stability

At the left end of the x-axis are businesses that directly observe their most important outcomes. Large merchants like IKEA are a good example of this: when a customer buys an ESKILSTUNA sofa instead of a KIVIK, the signal is unambiguous. Aggregate enough of these, and the company knows exactly which product has the higher purchase rate. When you can directly observe users voting with their wallets, you have a strong baseline that’s hard to beat.

At the other extreme are platforms that can’t observe whether their recommendations actually succeeded. Tinder and Bumble might see users match, but they often won’t know whether the pair hit it off (especially as users move off to other platforms). Yelp and Google Maps can recommend restaurants, but for the vast majority, they can’t observe whether you actually visited, just which listings you clicked. Relying on such upper-funnel signals means position bias dominates: items in top positions accumulate interactions regardless of true quality, making it nearly impossible to tell whether engagement reflects genuine preference or mere visibility. Contrast this with the IKEA example: a user might click a restaurant on Yelp simply because it appeared first, but they are far less likely to buy a sofa for that same reason. In the absence of a hard conversion, you lose the anchor of a reliable leaderboard. This forces you to work much harder to extract signal from the noise. Reviews can offer some grounding, but they are rarely dense enough to work as a primary signal. Instead, you are left to run endless experiments on your ranking heuristics, constantly tuning logic to squeeze a proxy for quality out of a stream of weak signals.

High-Churn Catalog

Even with observable outcomes, however, a strong baseline is not guaranteed. If your catalog is constantly changing, you may not accumulate enough data to build a proper leaderboard. Real estate platforms like Zillow and secondhand sites like Vinted face the most extreme version: each item has an inventory of one, disappearing the moment it’s purchased. This forces you to rely on simplistic and rigid sorts like “newest first” or “lowest price per square meter.” These are far weaker than conversion leaderboards based on real, dense user signal. To do better, you must leverage machine learning to predict conversion probability immediately, combining intrinsic attributes with debiased short-term performance to surface the best inventory before it disappears.

The Ubiquity of Feature-Based Models

Regardless of your catalog’s stability or signal strength, the core challenge remains the same: you are trying to improve upon whatever baseline is available. This is typically achieved by training a machine learning (ML) model to predict the probability of engagement or conversion given a specific context. Gradient-boosted trees (GBDTs) are the pragmatic choice, much faster to train and tune than deep learning.

GBDTs predict these outcomes based on engineered item features: categorical and numerical attributes that quantify and describe a product. Even before individual preferences are known, GBDTs can also adapt recommendations leveraging basic user features like country and device type. With these item and user features alone, an ML model can already improve upon the baseline — whether that means debiasing a popularity leaderboard or ranking a high-churn feed. For instance, in fashion e-commerce, models commonly use location and time of year to surface items tied to the season, while simultaneously using country and device to calibrate the price point.

These features allow the model to combat the aforementioned position bias by separating true quality from mere visibility. By learning which intrinsic attributes drive conversion, the model can correct for the position bias inherent in your popularity baseline. It learns to identify items that perform on merit, rather than simply because they were ranked at the top. This is harder than it looks: you risk demoting proven winners more than you should, potentially degrading the experience.

Contrary to popular belief, feature-based models can also drive personalization, depending on how much semantic information items naturally contain. Platforms like Booking.com and Yelp accumulate rich descriptions, multiple photos, and user reviews that provide semantic depth per listing. These can be encoded into semantic embeddings for personalization: by using the user’s recent interactions, we can calculate similarity scores against candidate items and feed these to the gradient-boosted model as features.

This approach has its limits, however. Feature-based models can recommend based on similarity to recent interactions, but unlike collaborative filtering, they don’t directly learn which items tend to be liked by similar users. To learn that, they need item similarity scores provided as input features. Whether this limitation matters depends on something more fundamental: how much users actually disagree.

Subjectivity

Not all domains are equally personal or controversial. In some, users largely agree on what makes a good product once basic constraints are satisfied. We call these convergent preferences, and they occupy the bottom half of the chart. Take Booking.com: travelers may have different budgets and location preferences, but once those are revealed through filters and map interactions, ranking criteria converge — higher prices are bad, amenities are good, good reviews are better. Or consider Staples: once a user needs printer paper or AA batteries, brand and price dominate, making user preferences remarkably consistent.

At the other extreme — the top half — are subjective domains defined by highly fragmented taste. Spotify exemplifies this: one user’s favorite track is another’s immediate skip. Yet, taste rarely exists in a vacuum. Somewhere in the data is a user on your exact wavelength, and machine learning bridges the gap, turning their discoveries from yesterday into your recommendations for today. Here, the value of personalization is enormous, and so is the technical investment required.

The Right Data

Subjective taste is only actionable if you have enough data to observe it. Many domains involve distinct preferences but lack the feedback loop to capture them. A niche content platform, new marketplace, or B2B product may face wildly divergent tastes yet lack the clear signal to learn them. Yelp restaurant recommendations illustrate this challenge: dining preferences are subjective, but the platform can’t observe actual restaurant visits, only clicks. This means they can’t optimize personalization for the true target (conversions). They can only optimize for proxy metrics like clicks, but more clicks might actually signal failure, indicating users are browsing multiple listings without finding what they want.

But in subjective domains with dense behavioral data, failing to personalize leaves money on the table. YouTube exemplifies this: with billions of daily interactions, the platform learns nuanced viewer preferences and surfaces videos you didn’t know you wanted. Here, deep learning becomes unavoidable. This is the point where you’ll see large teams coordinating over Jira and cloud bills that require VP approval. Whether that complexity is justified comes down entirely to the data you have.

Know Where You Stand

Understanding where your problem sits on this spectrum is far more valuable than blindly chasing the latest architecture. The industry’s “state-of-the-art” is largely defined by the outliers — the tech giants dealing with massive, subjective inventories and dense user data. Their solutions are famous because their problems are extreme, not because they are universally correct.

However, you’ll likely face different constraints in your own work. If your domain is defined by a stable catalog and observable outcomes, you land in the bottom-left quadrant alongside companies like IKEA and Booking.com. Here, popularity baselines are so strong that the challenge is simply building upon them with machine learning models that can drive measurable A/B test wins. If, instead, you face high churn (like Vinted) or weak signals (like Yelp), machine learning becomes a necessity just to keep up.

But that doesn’t mean you’ll need deep learning. That added complexity only truly pays off in territories where preferences are deeply subjective and there’s enough data to model them. We often treat systems like Netflix or Spotify as the gold standard, but they are specialized solutions to rare conditions. For the rest of us, excellence isn’t about deploying the most complex architecture available; it’s about recognizing the constraints of the terrain and having the confidence to choose the solution that solves your problems.

Images by the author.