Retrieval Augmented Generation (RAG) Inference Engines with LangChain on CPUs | by Eduardo Alvarez

Contents

Setting up Dependencies The Data and Model Configuring the Model Building the Vector Database with ChromaDB Executing the Retrieval Mechanism Interactive Experimentation

To follow along with the following hands-on example, create a free account on the Intel Developer Cloud and navigate to the “Training and Workshops” page. Under the Gen AI Essentials section, select Retrieval Augmented Generation (RAG) with LangChain option. Follow the instructions on the webpage to launch a JupyterLab window and automatically load the notebook with all of the sample code.

The notebook includes detailed docstrings and descriptions of the code. This article will discuss the high-level mechanics while providing context for specific functions.

Setting up Dependencies

We start by installing all of the required packages into the base environment. You’re welcome to create your conda environment, but this is a quick and easy way to start.

import sys
import os
!sys.executable -m pip install langchain==0.0.335 --no-warn-script-location > /dev/null
!sys.executable -m pip install pygpt4all==1.1.0 --no-warn-script-location > /dev/null
!sys.executable -m pip install gpt4all==1.0.12 --no-warn-script-location > /dev/null
!sys.executable -m pip install transformers==4.35.1 --no-warn-script-location > /dev/null
!sys.executable -m pip install datasets==2.14.6 --no-warn-script-location > /dev/null
!sys.executable -m pip install tiktoken==0.4.0 --no-warn-script-location > /dev/null
!sys.executable -m pip install chromadb==0.4.15 --no-warn-script-location > /dev/null
!sys.executable -m pip install sentence_transformers==2.2.2 --no-warn-script-location > /dev/null

These commands will install all the necessary packages into your base environment.

The Data and Model

We will be using a quantized version of Falcon 7B (gpt4all-falcon-q4_0) from the GPT4All project. You can learn more about this model on the GPT4ALL page in the “Model Explorer” section. The model has been stored on disk to simplify the model access process.

The following logic downloads the available datasets from a Hugging Face project called FunDialogues. The selected data will be passed through an embedding model and placed in our vector database in a subsequent step.

def download_dataset(self, dataset):
"""
Downloads the specified dataset and saves it to the data path.Parameters
----------
dataset : str
The name of the dataset to be downloaded.
"""
self.data_path = dataset + '_dialogues.txt'
if not os.path.isfile(self.data_path):
datasets = "robot maintenance": "FunDialogues/customer-service-robot-support", 
"basketball coach": "FunDialogues/sports-basketball-coach", 
"physics professor": "FunDialogues/academia-physics-office-hours",
"grocery cashier" : "FunDialogues/customer-service-grocery-cashier"
# Download the dialogue from hugging face
dataset = load_dataset(f"datasets[dataset]")
# Convert the dataset to a pandas dataframe
dialogues = dataset['train']
df = pd.DataFrame(dialogues, columns=['id', 'description', 'dialogue'])
# Print the first 5 rows of the dataframe
df.head()
# only keep the dialogue column
dialog_df = df['dialogue']
# save the data to txt file
dialog_df.to_csv(self.data_path, sep=' ', index=False)
else:
print('data already exists in path.')

In the code snippet above, you can select from 4 different synthetic datasets:

Robot Maintenance: conversations between a technician and a customer support agent while troubleshooting a robot arm.
Basketball Coach: conversations between basketball coaches and players during a game.
Physics Professor: conversations between students and physics professor during office hours.
Grocery Cashier: conversations between a grocery store cashier and customers

Configuring the Model

The GPT4ALL extension in the LangChain API takes care of loading the model into memory and establishing a variety of parameters, such as:

model_path: This line specifies the file path for a pre-trained model.
n_threads: Sets the number of threads to be used, which might influence parallel processing or inference speed. This is especially relevant for multi-core systems.
max_tokens: Limits the number of tokens (words or subwords) for the input or output sequences, ensuring that the data fed into or produced by the model does not exceed this length.
repeat_penalty: This parameter possibly penalizes repetitive content in the model’s output. A value greater than 1.0 prevents the model from generating repeated sequences.
n_batch: Specifies the batch size for processing data. This can help optimize processing speed and memory usage.
top_k: Defines the “top-k” sampling strategy during the model’s generation. When generating text, the model will consider only the top k most probable next tokens.

def load_model(self, n_threads, max_tokens, repeat_penalty, n_batch, top_k, temp):
"""
Loads the model with specified parameters for parallel processing.Parameters
----------
n_threads : int
The number of threads for parallel processing.
max_tokens : int
The maximum number of tokens for model prediction.
repeat_penalty : float
The penalty for repeated tokens in generation.
n_batch : int
The number of batches for processing.
top_k : int
The number of top k tokens to be considered in sampling.
"""
# Callbacks support token-wise streaming
callbacks = [StreamingStdOutCallbackHandler()]
# Verbose is required to pass to the callback manager
self.llm = GPT4All(model=self.model_path, callbacks=callbacks, verbose=False,
n_threads=n_threads, n_predict=max_tokens, repeat_penalty=repeat_penalty, 
n_batch=n_batch, top_k=top_k, temp=temp)

Building the Vector Database with ChromaDB

The Chroma vector database is an integral part of our RAG setup, where we store and manage our data efficiently. Here’s how we build it:

def build_vectordb(self, chunk_size, overlap):
"""
Builds a vector database from the dataset for retrieval purposes.Parameters
----------
chunk_size : int
The size of text chunks for vectorization.
overlap : int
The overlap size between chunks.
"""
loader = TextLoader(self.data_path)
# Text Splitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=overlap)
# Embed the document and store into chroma DB
self.index = VectorstoreIndexCreator(embedding= HuggingFaceEmbeddings(), text_splitter=text_splitter).from_loaders([loader])

Executing the Retrieval Mechanism

Upon receiving a user’s query, we use similarity search to search our vector DB for similar data. Once a k number of matching results are found, they are retrieved and used to add the context to the user’s query. We use the PromptTemplate function to build a template and embed the user’s query alongside the retrieved context. Once the template has been populated, we move on to the inference component.

def retrieval_mechanism(self, user_input, top_k=1, context_verbosity = False, rag_off= False):
"""
Retrieves relevant document snippets based on the user's query.Parameters
----------
user_input : str
The user's input or query.
top_k : int, optional
The number of top results to return, by default 1.
context_verbosity : bool, optional
If True, additional context information is printed, by default False.
rag_off : bool, optional
If True, disables the retrieval-augmented generation, by default False.
"""
self.user_input = user_input
self.context_verbosity = context_verbosity
# perform a similarity search and retrieve the context from our documents
results = self.index.vectorstore.similarity_search(self.user_input, k=top_k)
# join all context information into one string 
context = "\n".join([document.page_content for document in results])
if self.context_verbosity:
print(f"Retrieving information related to your question...")
print(f"Found this content which is most similar to your question: context")
if rag_off:
template = """Question: question
Answer: This is the response: """
self.prompt = PromptTemplate(template=template, input_variables=["question"])
else:     
template = """ Don't just repeat the following context, use it in combination with your knowledge to improve your answer to the question:context
Question: question
"""
self.prompt = PromptTemplate(template=template, input_variables=["context", "question"]).partial(context=context)

The LangChain LLMChain utility to execute inference based on the query passed by the user and the configured template. The result is returned to the user.

 def inference(self):
"""
Performs inference to generate a response based on the user's query.Returns
-------
str
The generated response.
"""
if self.context_verbosity:
print(f"Your Query: self.prompt")
llm_chain = LLMChain(prompt=self.prompt, llm=self.llm)
print("Processing the information with gpt4all...\n")
response = llm_chain.run(self.user_input)
return  response

Interactive Experimentation

To help you get started quickly, the notebook includes integrated ipywidget components. You must run all the cells in the notebook to enable these components. We encourage you to adjust the parameters and evaluate the impact on the latency and fidelity of the system’s response. Remember, this is just a starting point and a basic demonstration of RAG’s capabilities.

Figure 10. In this example, we get a quick taste of the power of RAG, clearly seeing the benefits of the additional context provided by RAG, which helps yield a helpful answer to the user’s question — “My robot is not turning on, Can you help me?” The RAG-enabled output provides valid recommendations, while the raw model without RAG simply provides a polite inquiry that is not very helpful to the user.

No one wants to interact with slow, unstable chatbots that respond with bogus information. There are a plethora of technical stack combinations to help developers avoid building systems that yield terrible user experiences. In this article, we have interpreted the importance of inference engine quality to the user experience from the perspective of a stack that enables scale, fidelity, and latency benefits. The combination of RAG, CPUs, and model optimization techniques checks all corners of the IQ Triangle (Figure 3), aligning well with the needs of operational LLM-based AI chat applications.

A few exciting things to try would be:

Edit the prompt template found in the retrieval_mechanism method to engineer better prompts in tandem with the retrieved context.
Adjust the various model and RAG-specific parameters and evaluate the impact on inference latency and response quality.
Add new datasets that are meaningful to your domain and test the viability of using RAG to build your AI chat-based applications.
This example’s model (gpt4all-falcon-q4_0) is not optimized for Xeon processors. Explore using models that are optimized for CPU platforms and evaluate the inference latency benefits.

Thank you for reading! Don’t forget to follow my profile for more articles like this!