, I’d like to share a practical variation of Uber’s Two-Tower Embedding (TTE) approach for cases where both user-related data and computing resources are limited. The problem came from a high-traffic discovery widget on the home screen of a food delivery app. This widget shows curated selections such as Italian, Burgers, Sushi, or Healthy. The selections are created from tags: each restaurant can have multiple tags, and each tile is essentially a tag-defined slice of the catalog (with the addition of some manual picking). In other words, the candidate set is already known, so the real problem is not retrieval but ranking.
At that time this widget was significantly underperforming in comparison to other widgets on a discovery (main) screen. The final selection was ranked on general popularity without taking into account any personalized signals. What we discovered is that users are reluctant to scroll and if they don’t find something interesting within the first 10 to 12 positions then they usually do not convert. On the other hand the selections can be massive sometimes, in some cases up to 1500 restaurants. On top of that a single restaurant could be selected for different selections, which means that for example McDonald’s can be selected for both Burgers and Ice Cream, but it’s clear that its popularity is only valid for the first selection, but the general popularity sorting would put it on top in both selections.
The product setup makes the problem even less friendly to static solutions such as general popularity sorting. These collections are dynamic and change frequently due to seasonal campaigns, operational needs, or new business initiatives. Because of that, training a dedicated model for each individual selection is not realistic. A useful recommender has to generalize to new tag-based collections from day one.
Before moving to a two-tower-style solution, we tried simpler approaches such as localized popularity ranking at the city-district level and multi-armed bandits. In our case, neither delivered a measurable uplift over a general popularity sort. As a part of our research initiative we tried to adjust Uber’s TTE for our case.
Two-Tower Embeddings Recap
A two-tower model learns two encoders in parallel: one for the user side and one for the restaurant side. Each tower produces a vector in a shared latent space, and relevance is estimated from a similarity score, usually a dot product. The operational advantage is decoupling: restaurant embeddings can be precomputed offline, while the user embedding is generated online at request time. This makes the approach attractive for systems that need fast scoring and reusable representations.
Uber’s write-up focused mainly on retrieval, but it also noted that the same architecture can serve as a final ranking layer when candidate generation is already handled elsewhere and latency must remain low. That second formulation was much closer to our use case.
Our Approach
We kept the two-tower structure but simplified the most resource-heavy parts. On the restaurant side, we did not fine-tune a language model inside the recommender. Instead, we reused a TinyBERT model that had already been fine-tuned for search in the app and treated it as a frozen semantic encoder. Its text embedding was combined with explicit restaurant features such as price, ratings, and recent performance signals, plus a small trainable restaurant ID embedding, and then projected into the final restaurant vector. This gave us semantic coverage without paying the full cost of end-to-end language-model training. For a POC or MVP, a small frozen sentence-transformer would be a reasonable starting point as well.
We avoided learning a dedicated user-ID embedding and instead represented each user on the fly through their previous interactions. The user vector was built from averaged embeddings of restaurants the customer had ordered from (Uber’s post mentioned this source as well, but the authors do not specify how it was used), together with user and session features. We also used views without orders as a weak negative signal. That mattered when order history was sparse or irrelevant to the current selection. If the model could not clearly infer what the user liked, it still helped to know which restaurants had already been explored and rejected.
The most important modeling choice was filtering that history by the tag of the current selection. Averaging the whole order history created too much noise. If a customer mostly ordered burgers and then opened an Ice Cream selection, a global average could pull the model toward burger places that happened to sell desserts rather than toward the strongest ice cream candidates. By filtering past interactions to matching tags before averaging, we made the user representation contextual instead of global. In practice, this was the difference between modeling long-term taste and modeling current intent.
Finally, we trained the model at the session level and used multi-task learning. The same restaurant could be positive in one session and negative in another, depending on the user’s current intent. The ranking head predicted click, add-to-basket, and order jointly, with a simple funnel constraint: P(order) ≤ P(add-to-basket) ≤ P(click). This made the model less static and improved ranking quality compared with optimizing a single target in isolation.
Offline validation was also stricter than a random split: evaluation used out-of-time data and users unseen during training, which made the setup closer to production behavior.
Outcomes
According to A/B tests the final system showed a statistically significant uplift in conversion rate. Just as importantly, it was not tied to one widget. Because the model scores a user–restaurant pair rather than a fixed list, it generalized naturally to new selections without architectural changes since tags are part of restaurant’s metadata and can be retrieved without selections in mind.
That transferability made the model useful beyond the original ranking surface. We later reused it in Ads, where its CTR-oriented output was applied to individual promoted restaurants with positive results. The same representation learning setup therefore worked both for selection ranking and for other recommendation-like placement problems inside the app.
Further Research
The most obvious next step is multimodality. Restaurant images, icons, and potentially menu visuals can be added as extra branches to the restaurant tower. That matters because click behavior is strongly influenced by presentation. A pizza place inside a pizza selection may underperform if its main image does not show pizza, while a budget restaurant can look premium purely because of its hero image. Text and tabular features do not capture that gap well.
Key Takeaways:
- Two-Tower models can work even with limited data. You don’t need Uber-scale infrastructure if candidate retrieval is already solved and the model focuses only on the ranking stage.
- Reuse pretrained embeddings instead of training from scratch. A frozen lightweight language model (e.g., TinyBERT or a small sentence-transformer) can provide strong semantic signals without expensive fine-tuning.
- Averaging embeddings of previously ordered restaurants works surprisingly well when user history is sparse.
- Contextual filtering reduces noise and helps the model capture the user’s current intent, not just long-term taste.
- Negative signals help in sparse environments. Restaurants that users viewed but did not order from provide useful information when positive signals are limited.
- Multi-task learning stabilizes ranking. Predicting click, add-to-basket, and order jointly with funnel constraints produces more consistent scores.
- Design for reuse. A model that scores user–restaurant pairs rather than specific lists can be reused across product surfaces such as selections, search ranking, or ads.