Given the complexity of our table, the agent might not fully understand the information in the database by only examining the schema and a sample row. For example, the agent should recognise that a query regarding cars equates to searching the ‘category’ column for the value ‘Automobiles et moyens de déplacement’ (i.e., ‘Automobiles and means of transportation’). Therefore, additional tools are necessary to provide the agent with more context about the database.
Here’s a breakdown of the extra tools we plan to use:
get_categories_and_sub_categories: This tool is designed to help the agent fetch a list of distinct items from thecategoryandsub_categorycolumns. This approach is effective due to the relatively low number of unique values within these columns. If the columns contained hundreds or thousands of unique values, it might be better to use a retrieval tool. In such cases, when a user asks about a category, the agent could look for the most similar categories in a vector database, which stores embeddings of various values. The agent would then use these categories for its SQL queries. However, given that ourcategoryandsub_categorycolumns don’t have a wide range of unique values, we’ll simply return the list directly.
from langchain.tools import tool, Toolimport ast
import json
from sql_agent.sql_db import db
def run_query_save_results(db, query):
res = db.run(query)
res = [el for sub in ast.literal_eval(res) for el in sub]
return res
def get_categories(query: str) -> str:
"""
Useful to get categories and sub_categories. A json is returned where the key can be category or sub_category,
and the value is a list of unique itmes for either both.
"""
sub_cat = run_query_save_results(
db, "SELECT DISTINCT sous_categorie_de_produit FROM rappel_conso_table"
)
cat = run_query_save_results(
db, "SELECT DISTINCT categorie_de_produit FROM rappel_conso_table"
)
category_str = (
"List of unique values of the categorie_de_produit column : \n"
+ json.dumps(cat, ensure_ascii=False)
)
sub_category_str = (
"\n List of unique values of the sous_categorie_de_produit column : \n"
+ json.dumps(sub_cat, ensure_ascii=False)
)
return category_str + sub_category_str
get_columns_descriptions: Since we can’t feed the columns descriptions in the schema directly, we created an extra tool that returns short description for every ambiguous column. Some examples include:
"reference_fiche": "primary key of the database and unique identifier in the database. ",
"nom_de_la_marque_du_produit": "A string representing the Name of the product brand. Example: Apple, Carrefour, etc ... When you filter by this column,you must use LOWER() function to make the comparison case insensitive and you must use LIKE operator to make the comparison fuzzy.",
"noms_des_modeles_ou_references": "Names of the models or references. Can be used to get specific infos about the product. Example: iPhone 12, etc, candy X, product Y, bread, butter ...",
"identification_des_produits": "Identification of the products, for example the sales lot.",
def get_columns_descriptions(query: str) -> str:
"""
Useful to get the description of the columns in the rappel_conso_table table.
"""
return json.dumps(COLUMNS_DESCRIPTIONS)
get_today_date: tool that retrieves today’s date using python datetime library. The agent will use this tool when asked about temporality. For example: “What are the recalled products since last week ?”
from datetime import datetimedef get_today_date(query: str) -> str:
"""
Useful to get the date of today.
"""
# Getting today's date in string format
today_date_string = datetime.now().strftime("%Y-%m-%d")
return today_date_string
Finally we create a list of all these tools and we feed it to the create_sql_agent function. For every tool we must define a unique name within the set of tools provided to the agent. The description is optional but is very recommended as it can be used to provide more information.
def sql_agent_tools():
tools = [
Tool.from_function(
func=get_categories,
name="get_categories_and_sub_categories",
description="""
Useful to get categories and sub_categories. A json is returned where the key can be category or sub_category,
and the value is a list of unique items for either both.
""",
),
Tool.from_function(
func=get_columns_descriptions,
name="get_columns_descriptions",
description="""
Useful to get the description of the columns in the rappel_conso_table table.
""",
),
Tool.from_function(
func=get_today_date,
name="get_today_date",
description="""
Useful to get the date of today.
""",
),
]
return tools
extra_tools = sql_agent_tools()agent = create_sql_agent(
llm=llm_agent,
toolkit=toolkit,
agent_type=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
extra_tools=extra_tools,
verbose=True,
)
Sometimes, the tool descriptions aren’t enough for the agent to understand when to use them. To address this, we can change the ending part of the agent LLM prompt, known as the suffix. In our setup, the prompt has three sections:
- Prefix: This is a string placed before the tool list. We’re sticking with the default prefix, which instructs the agent on how to create and execute SQL queries in response to user questions, set a limit on result numbers to 10 , check the queries carefully, and avoid making changes to the database.
- The list of tools: This part lists out all the tools that the agent has at its disposal.
- Suffix: This is the part where we give the agent directions on how to process and think about the user’s question.
Here’s the default suffix for the SQL ReAct agent in Langchain:
SQL_SUFFIX = """Begin!Question: input
Thought: I should look at the tables in the database to see what I can query. Then I should query the schema of the most relevant tables.
agent_scratchpad"""
input and agent_scratchpad are two placeholders. input represents the user’s query and agent_scratchpad will represent the history of tool invocations and the corresponding tool outputs.
We can make the “Thought” part longer to give more instructions on which tools to use and when:
CUSTOM_SUFFIX = """Begin!Question: input
Thought Process: It is imperative that I do not fabricate information not present in the database or engage in hallucination;
maintaining trustworthiness is crucial. If the user specifies a category, I should attempt to align it with the categories in the `categories_produits`
or `sous_categorie_de_produit` columns of the `rappel_conso_table` table, utilizing the `get_categories` tool with an empty string as the argument.
Next, I will acquire the schema of the `rappel_conso_table` table using the `sql_db_schema` tool.
Utilizing the `get_columns_descriptions` tool is highly advisable for a deeper understanding of the `rappel_conso_table` columns, except for straightforward tasks.
When provided with a product brand, I will search in the `nom_de_la_marque_du_produit` column; for a product type, in the `noms_des_modeles_ou_references` column.
The `get_today_date` tool, requiring an empty string as an argument, will provide today's date.
In SQL queries involving string or TEXT comparisons, I must use the `LOWER()` function for case-insensitive comparisons and the `LIKE` operator for fuzzy matching.
Queries for currently recalled products should return rows where `date_de_fin_de_la_procedure_de_rappel` (the recall's ending date) is null or later than today's date.
When presenting products, I will include image links from the `liens_vers_les_images` column, formatted strictly as: [lien vers l'image] url1, [lien vers l'image] url2 ... Preceded by the mention in the query's language "here is(are) the image(s) :"
Additionally, the specific recalled product lot will be included from the `identification_des_produits` column.
My final response must be delivered in the language of the user's query.
agent_scratchpad
"""
This way, the agent doesn’t just know what tools it has but also gets better guidance on when to use them.
Now let’s modify the arguments for the create_sql_agent to account for new suffix:
agent = create_sql_agent(
llm=llm_agent,
toolkit=toolkit,
agent_type=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
suffix=CUSTOM_SUFFIX,
extra_tools=agent_tools,
verbose=True,
)
Another option we considered was to include the instructions in the prefix. However, our empirical observations indicated that this had little to no impact on the final response. Therefore, we chose to retain the instructions in the suffix. Conducting a more extensive evaluation of the model outputs could be beneficial for a detailed comparison of the two approaches.