Build a (recipe) recommender chatbot using RAG and hybrid search (Part I) | by Sebastian Bahr

Contents

Generate sparse embeddings Generating dense embeddings Upload data to vector database

For this project, we will use recipes from Public Domain Recipes. All recipes are stored as markdown files in this GitHub repository. For this tutorial, I already did some data cleaning and created features from the raw text input. If you are keen on doing the data cleaning part yourself, the code is available on my GitHub repository.

The dataset consists of the following columns:

title: the title of the recipe
date: the date the recipe was added
tags: a list of tags that describe the meal
introduction: an introduction to the recipe, the content varies strongly between records
ingredients: all needed ingredients. Note that I removed the quantity as it is not needed for creating embeddings and contrary may lead to undesirable recommendations.
direction: all required steps you need to perform to cook the meal
recipe_type: indicator if the recipe is vegan, vegetarian, or regular
output: contains the title, ingredients, and direction of the recipe and will be later provided to the chat model as input.

Let’s have a look at the distribution of the recipe_type feature. We see that the majority (60%) of the recipes include fish or meat and aren’t vegetarian-friendly. Approximately 35% are vegetarian-friendly and only 5% are vegan-friendly. This feature will be used as a hard filter for retrieving matching recipes from the vector database.

import re
import json
import spacy
import torch
import openai
import vertexai
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tqdm.auto import tqdm
from transformers import AutoModelForMaskedLM, AutoTokenizer
from pinecone import Pinecone, ServerlessSpec
from vertexai.language_models import TextEmbeddingModel
from utils_google import authenticate
credentials, PROJECT_ID, service_account, pinecone_API_KEY = authenticate() 
from utils_openai import authenticate
OPENAI_API_KEY = authenticate() openai_client = openai.OpenAI(api_key=OPENAI_API_KEY)
REGION = "us-central1"
vertexai.init(project = PROJECT_ID,
location = REGION,
credentials = credentials)
pc = Pinecone(api_key=pinecone_API_KEY)
# download spacy model
#!python -m spacy download en_core_web_sm

recipes = pd.read_json("recipes_v2.json")
recipes.head()

plt.bar(recipes.recipe_type.unique(), recipes.recipe_type.value_counts(normalize=True).values)
plt.show()

Hybrid search uses a combination of sparse and dense vectors and a weighting factor alpha, which allows adjusting the importance of the dense vector in the retrieval process. In the following, we will create dense vectors based on the title, tags, and introduction and sparse vectors on the ingredients. By adjusting alpha we can therefore later on determine how much “attention” should be paid to ingredients the user mentioned in its query.

Before creating the embeddings a new feature needs to be created that contains the combined information of the title, the tags, and the introduction.

recipes["dense_feature"] = recipes.title + "; " + recipes.tags.apply(lambda x: str(x).strip("[]").replace("'", "")) + "; " + recipes.introduction
recipes["dense_feature"].head()

Finally, before diving deeper into the generation of the embeddings we’ll have a look at the output column. The second part of the tutorial will be all about creating a chatbot using OpenAI that is able to answer user questions using knowledge from our recipe database. Therefore, after finding the recipes that match best the user query the chat model needs some information it builds its answer on. That’s where the output is used, as it contains all the needed information for an adequate answer

# example output
'title': 'Creamy Mashed Potatoes',
'ingredients': 'The quantities here are for about four adult portions. If you are planning on eating this as a side dish, it might be more like 6-8 portions. * 1kg potatoes * 200ml milk* * 200ml mayonnaise* * ~100g cheese * Garlic powder * 12-16 strips of bacon * Butter * 3-4 green onions * Black pepper * Salt  *You can play with the proportions depending on how creamy or dry you want the mashed potatoes to be.',
'direction': '1. Peel and cut the potatoes into medium sized pieces. 2. Put the potatoes in a pot with some water so that it covers the potatoes and   boil them for about 20-30 minutes, or until the potatoes are soft. 3. About ten minutes before removing the potatoes from the boiling water, cut   the bacon into little pieces and fry it. 4. Warm up the milk and mayonnaise. 5. Shred the cheese. 6. When the potatoes are done, remove all water from the pot, add the warm milk   and mayonnaise mix, add some butter, and mash with a potato masher or a   blender. 7. Add some salt, black pepper and garlic powder to taste and continue mashing   the mix. 8. Once the mix is somewhat homogeneous and the potatoes are properly mashed,   add the shredded cheese and fried bacon and mix a little. 9. Serve and top with chopped green onions.'

Further, a unique identifier needs to be added to each recipe, which allows retrieving the records of the recommended candidate recipes and their output.

recipes["ID"] = range(len(recipes))

Generate sparse embeddings

The next step involves creating sparse embeddings for all 360 observations. To calculate these embeddings, a more sophisticated method than the frequently used TF-IDF or BM25 approach is used. Instead, the SPLADE Sparse Lexical and Expansion model is applied. A detailed explanation of SPLADE can be found here. Dense embeddings have the same shape for each text input, regardless of the number of tokens in the input. In contrast, sparse embeddings contain a weight for each unique token in the input. The dictionary below represents a sparse vector, where the token ID is the key and the assigned weight is the value.

model_id = "naver/splade-cocondenser-ensembledistil"tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)
def to_sparse_vector(text, tokenizer, model):
tokens = tokenizer(text, return_tensors='pt')
output = model(**tokens)
vec = torch.max(
torch.log(1 + torch.relu(output.logits)) * tokens.attention_mask.unsqueeze(-1), dim=1
)[0].squeeze()
cols = vec.nonzero().squeeze().cpu().tolist()
weights = vec[cols].cpu().tolist()
sparse_dict = dict(zip(cols, weights))
return sparse_dict
sparse_vectors = []
for i in tqdm(range(len(recipes))):
sparse_vectors.append(to_sparse_vector(recipes.iloc[i]["ingredients"], tokenizer, model))
recipes["sparse_vectors"] = sparse_vectors

Generating dense embeddings

At this point of the tutorial, some costs will arise if you use a text embedding model from VertexAI (Google) or OpenAI. However, if you use the same dataset, the costs will be at most $5. The cost may vary if you use a dataset with more records or longer texts, as you are charged by tokens. If you do not wish to incur any costs but still want to follow the tutorial, particularly the second part, you can download the pandas DataFrame recipes_with_vectors.pkl with pre-generated embedding data from my GitHub repository.

You can choose to use either VertexAI or OpenAI to create the embeddings. OpenAI has the advantage of being easy to set up with an API key, while VertexAI requires logging into Google Console, creating a project, and adding the VertexAI API to your project. Additionally, the OpenAI model allows you to specify the number of dimensions for the dense vector. Nevertheless, both of them create state-of-the-art dense embeddings.

Using VertexAI API

# running this code will create costs !!!
model = TextEmbeddingModel.from_pretrained("textembedding-gecko@003")def to_dense_vector(text, model):
dense_vectors = model.get_embeddings([text])
return [dense_vector.values for dense_vector in dense_vectors][0]
dense_vectors = []
for i in tqdm(range(len(recipes))):
dense_vectors.append(to_dense_vector(recipes.iloc[i]["dense_feature"], model))
recipes["dense_vectors"] = dense_vectors

Using OpenAI API

# running this code will create costs !!!# Create dense embeddings using OpenAIs text embedding model with 768 dimensions
model = "text-embedding-3-small"
def to_dense_vector_openAI(text, client, model, dimensions):
dense_vectors = client.embeddings.create(model=model, dimensions=dimensions, input=[text])
return [dense_vector.values for dense_vector in dense_vectors][0]
dense_vectors = []
for i in tqdm(range(len(recipes))):
dense_vectors.append(to_dense_vector_openAI(recipes.iloc[i]["dense_feature"], openai_client, model, 768))
recipes["dense_vectors"] = dense_vectors

Upload data to vector database

After generating the sparse and dense embeddings, we have all the necessary data to upload them to a vector database. In this tutorial, Pinecone will be used as they allow performing a hybrid search using sparse and dense vectors and offer a serverless pricing schema with $100 free credits. To perform a hybrid search later on, the similarity metric needs to be set to dot product. If we would only perform a dense instead of a hybrid search we would be able to select one of these similarity metrics: dot product, cosine, and Euclidean distance. More information about similarity metrics and how they calculate the similarity between two vectors can be found here.

# load pandas DataFrame with pre-generated embeddings if you
# didn't generate them in the last step
recipes = pd.read_pickle("recipes_with_vectors.pkl")# if you need to delte an existing index
pc.delete_index("index-name")
# create a new index 
pc.create_index(
name="recipe-project",
dimension=768, # adjust if needed
metric="dotproduct",
spec=ServerlessSpec(
cloud="aws",
region="us-west-2"
)
)
pc.describe_index("recipe-project")

Congratulations on creating your first Pinecone index! Now, it’s time to upload the embedded data to the vector database. If the embedding model you used creates vectors with a different number of dimensions, make sure to adjust the dimension argument.

Now it’s time to upload the data to the newly created Pinecone index.

# upsert to pinecone in batches
def sparse_to_dict(data):
dict_ = "indices": list(data.keys()),
"values": list(data.values())
return dict_batch_size = 100
index = pc.Index("recipe-project")
for i in tqdm(range(0, len(recipes), batch_size)):
i_end = min(i + batch_size, len(recipes))
meta_batch = recipes.iloc[i: i_end][["ID", "recipe_type"]]
meta_dict = meta_batch.to_dict(orient="records")
sparse_batch = recipes.iloc[i: i_end]["sparse_vectors"].apply(lambda x: sparse_to_dict(x))
dense_batch = recipes.iloc[i: i_end]["dense_vectors"]
upserts = []
ids = [str(x) for x in range(i, i_end)]
for id_, meta, sparse_, dense_ in zip(ids, meta_dict, sparse_batch, dense_batch):
upserts.append(
"id": id_,
"sparse_values": sparse_,
"values": dense_,
"metadata": meta
)
index.upsert(upserts)
index.describe_index_stats()

If you are curious about what the uploaded data looks like, log in to Pinecone, select the newly created index, and have a look at its items. For now, we don’t need to pay attention to the score, as it is generated by default and indicates the match with a vector randomly generated by Pinecone. However, later we will calculate the similarity of the embedded user query with all items in the vector database and retrieve the k most similar items. Further, each item contains an item ID generated by Pinecone, and the metadata, which consists of the recipe ID and its recipe_type. The dense embeddings are stored in Values and the sparse embeddings in Sparse Values.

The first three items of the index (*Image by author*)

We can fetch the information from above using the Pinecone Python SDK. Let’s have a look at the stored information of the first item with the index item ID 50.

index.fetch(ids=["50"])

As in the Pinecone dashboard, we get the item ID of the element, its metadata, the sparse values, and the dense values, which are stored in the list at the bottom of the truncated output.