A Multi-Agent SQL Assistant You Can Trust with Human-in-Loop Checkpoint & LLM Cost Control

about building your own AI agents? Are you constantly overwhelmed by all the buzzwords around agents? You’re not alone; I’ve also been there. There are numerous tools available, and even figuring out which one to choose can feel like a project in itself. Additionally, there is uncertainty surrounding the cost and infrastructure. Will I consume too many tokens? How and where can I deploy my solution?

For a while, I also hesitated to build something on my own. I needed to understand the basics first, see a few examples to understand how things work, and then try some hands-on experience to bring those concepts to life. After lots of research, I finally landed on CrewAI — and it turned out to be the perfect starting point. There are two great courses offered by DeepLearning.AI: Multi AI Agent Systems with crewAI & Practical Multi AI Agents and Advanced Use Cases with crewAI. In the course, the instructor has very clearly explained everything you need to know about AI agents to get started. There are more than 10 case studies with codes provided in the course which serves as a good starting point.

It’s not enough to just learn about stuff anymore. If you have not applied what you have learned, you are likely to forget the basics with time. If I just re-rerun the use cases from the course, it’s not really “applying”. I had to build something and implement it for myself. I decided to build a use case that was closely related to what I work with. As a data analyst and engineer, I mostly work with Python and SQL. I thought to myself how cool it would be if I could build an assistant that would generate SQL queries based on natural language. I agree there are already plenty of out-of-box solutions available in the market. I’m not trying to reinvent the wheel here. With this POC, I want to learn how such systems are built and what are their potential limitations. What I’m trying to uncover here is what it takes to build such an assistant.

Screenshot of the Demo App (by Author)

In this post, I’ll walk you through how I used CrewAI & Streamlit to build a Multi-Agent SQL Assistant. It lets users query a SQLite database using natural language. To have more control over the entire process, I’ve also incorporated a human-in-loop check plus I display the LLM usage costs for every query. Once a query is generated by the assistant, the user will have 3 options: accept and continue if the query looks good, ask the assistant to try again if the query seems off, or abort the whole process if it’s not working well. Having this checkpoint makes a huge difference — it gives more power to the user, avoids executing bad queries, and also helps in saving LLM costs in the long run.

You can find the entire code repository here. Below is the complete project structure:

SQL Assistant Crew Project Structure
===================================

.
├── app.py (Streamlit UI)
├── main.py (terminal)
├── crew_setup.py
├── config
│   ├── agents.yaml
│   └── tasks.yaml
├── data
│   └── sample_db.sqlite
├── utils
│   ├── db_simulator.py
│   └── helper.py

Created by Author using https://app.diagrams.net/

The Agent Architecture (my CrewAI team)

For my SQL Assistant system, I needed at least 3 basic agents to handle the entire process efficiently:

Query Generator Agent would convert the natural language questions by the user into a SQL query using the database schema as context.
Query Reviewer Agent would take the SQL query generated by the generator agent and optimizes it further for accuracy and efficiency.
Compliance Checker Agent would check the query for potential PII exposure and submit a verdict of whether the query is compliant or not.

Every agent must have 3 core attributes — a role (what the agent is supposed to be), a goal (what is the agent’s mission), and a backstory (set the personality of the agent to guide how it should behave). I have enabled verbose=“True” to view the Agent’s internal thought process. I’m using the openai/gpt-4o-mini as the underlying language model for all my agents. After a lot of trial and error, I set the temperature=0.2 to reduce the hallucinations of the agents. Lower temperatures make the model more deterministic and provide predictable outputs (like SQL queries in my case). There are many other parameters that are available to tune like max_tokens (set limits for the length of response), top_p (for nucleus sampling), allow_delegation (to delegate the task to other agents), etc. If you are using some other LLMs, you can simply specify the LLM model name here. You could set the same LLM for all the agents or different ones as per your requirements.

Below is the yaml file which has the definitions of the agents:

query_generator_agent:
  role: Senior Data Analyst
  goal: Translate natural language requests into accurate and efficient SQL queries
  backstory:  >
        You are an experienced analyst who knows SQL best practices. You work with stakeholders to gather requirements
        and turn their questions into clear, performant queries. You prefer readable SQL with appropriate filters and joins.
  allow_delegation: False
  verbose: True
  model: openai/gpt-4o-mini
  temperature: 0.2

query_reviewer_agent:
  role: SQL Code Reviewer
  goal: Critically evaluate SQL for correctness, performance, and clarity
  backstory: >
        You are a meticulous reviewer of SQL code. You identify inefficiencies, bad practices, and logical errors, and
        provide suggestions to improve the query's performance and readability.
  allow_delegation: False
  verbose: True
  model: openai/gpt-4o-mini
  temperature: 0.2

compliance_checker_agent:
  role: Data Privacy and Governance Officer
  goal: Ensure SQL queries follow data compliance rules and avoid PII exposure
  backstory: >
        You are responsible for ensuring queries do not leak or expose personally identifiable information (PII) or
        violate company policies. You flag any unsafe or non-compliant practices.
  allow_delegation: False
  verbose: True
  model: openai/gpt-4o-mini
  temperature: 0.2

Once you are done creating your agents, the next step is to define the tasks they should perform. Every task must have a clear description of what the agent is supposed to do. It’s highly recommended that you also set the expected_output parameter to shape the final response of the LLM. It’s a way of telling the LLM exactly the kind of answer you are expecting — it could be a text, a number, a query, or even an article. The description has to be as detailed and concrete as possible. Having vague descriptions will only result in vague or even completely wrong outputs. I had to modify the descriptions multiple times during testing to adjust the quality of the response the agent was generating. One of the features I love is the ability to inject dynamic inputs into the task descriptions by providing curly braces ({}). These placeholders could be user prompts, concepts, definitions, or even outputs of previous agents. All of these allow the LLMs to generate more accurate results.

query_task:
  description: |
    You are an expert SQL assistant. Your job is to translate user requests into SQL queries using ONLY the tables and columns listed below.
    SCHEMA:
    {db_schema}
    USER REQUEST:
    {user_input}
    IMPORTANT:
    - First, list which tables and columns from the schema you will use to answer the request.
    - Then, write the SQL query.
    - Only use the tables and columns from the schema above.
    - If the request cannot be satisfied with the schema, return a SQL comment (starting with --) explaining why.
    - Do NOT invent tables or columns.
    - Make sure the query matches the user's intent as closely as possible.
  expected_output: First, a list of tables and columns to use. Then, a syntactically correct SQL query using appropriate filters, joins, and groupings.

review_task:
  description: |
    Review the following SQL query for correctness, performance, and readability: {sql_query} and verify that the query fits the schema: {db_schema}
    Ensure that only tables and columns from the provided schema are used.
    IMPORTANT:
    - First, only review the SQL query provided for correctness, performance, or readability
    - Do NOT invent new tables or columns.
    - If the Query is already correct, return it unchanged.
    - If the Query is not correct and cannot be fixed, return a SQL comment (starting with --) explaining why.
  expected_output: An optimized or verified SQL query

compliance_task:
  description: >
    Review the following SQL query for compliance violations, including PII access, unsafe usage, or policy violations.
    List any issues found, or state "No issues found" if the query is compliant.
    SQL Query: {reviewed_sqlquery}
  expected_output: >
    A markdown-formatted compliance report listing any flagged issues, or stating that the query is compliant. Include a clear verdict at the top (e.g., "Compliant" or "Issues found")

It’s a good practice to have the agent and task definitions in separate YAML files. If you ever want to make any updates to the definitions of agents or tasks, you only need to modify the YAML files and not touch the codebase at all. In the crew_setup.py file, everything comes together. I read and loaded the agent and task configurations from their respective YAML files. I also created the definitions for all the expected outputs using Pydantic models to give them structure and validate what the LLM should return. I then assign the agents with their respective tasks and assemble my crew. There are multiple ways to structure your crew depending on the use case. A single crew of agents can perform tasks in sequence or parallel. Alternatively, you can create multiple crews, each responsible for a specific part of your workflow. For my use case, I chose to build multiple crews to have more control on the execution flow by inserting a human-in-loop checkpoint and control cost.

from crewai import Agent, Task, Crew
from pydantic import BaseModel, Field
from typing import List
import yaml

# Define file paths for YAML configurations
files = {
    'agents': 'config/agents.yaml',
    'tasks': 'config/tasks.yaml',
}

# Load configurations from YAML files
configs = {}
for config_type, file_path in files.items():
    with open(file_path, 'r') as file:
        configs[config_type] = yaml.safe_load(file)

# Assign loaded configurations to specific variables
agents_config = configs['agents']
tasks_config = configs['tasks']

class SQLQuery(BaseModel):
    sqlquery: str = Field(..., description="The raw sql query for the user input")

class ReviewedSQLQuery(BaseModel):
    reviewed_sqlquery: str = Field(..., description="The reviewed sql query for the raw sql query")

class ComplianceReport(BaseModel):
    report: str = Field(..., description="A markdown-formatted compliance report with a verdict and any flagged issues.")

# Creating Agents
query_generator_agent = Agent(
  config=agents_config['query_generator_agent']
)

query_reviewer_agent = Agent(
  config=agents_config['query_reviewer_agent']
)

compliance_checker_agent = Agent(
  config=agents_config['compliance_checker_agent']
)

# Creating Tasks
query_task = Task(
  config=tasks_config['query_task'],
  agent=query_generator_agent,
  output_pydantic=SQLQuery
)

review_task = Task(
  config=tasks_config['review_task'],
  agent=query_reviewer_agent,
  output_pydantic=ReviewedSQLQuery
)

compliance_task = Task(
  config=tasks_config['compliance_task'],
  agent=compliance_checker_agent,
  context=[review_task],
  output_pydantic=ComplianceReport
)

# Creating Crew objects for import
sql_generator_crew = Crew(
    agents=[query_generator_agent],
    tasks=[query_task],
    verbose=True
)

sql_reviewer_crew = Crew(
    agents=[query_reviewer_agent],
    tasks=[review_task],
    verbose=True
)

sql_compliance_crew = Crew(
    agents=[compliance_checker_agent],
    tasks=[compliance_task],
    verbose=True
)

I set up a local SQLite database with some sample data to simulate the real-life database interactions for my POC. I fetch the database schema which comprises all the tables and column names present in the system. I later fed this schema as context to the LLM along with the original user query to help the LLM generate a SQL query with the original tables and columns from the schema provided and not invent something by itself. Once the Generator agent creates a SQL query, it goes for a review by the Reviewer agent followed by a compliance check from the Compliance agent. Only after these reviews, do I allow the reviewed query to be executed on the database to show the final results to the user via the streamlit interface. By adding validation and safety checks, I ensure only high-quality queries are executed on the database minimising unnecessary token usage and compute costs for the long run.

import sqlite3
import pandas as pd

DB_PATH = "data/sample_db.sqlite"

def setup_sample_db():
    conn = sqlite3.connect(DB_PATH)
    cursor = conn.cursor()

    # Drop tables if they exist (for repeatability in dev)
    cursor.execute("DROP TABLE IF EXISTS order_items;")
    cursor.execute("DROP TABLE IF EXISTS orders;")
    cursor.execute("DROP TABLE IF EXISTS products;")
    cursor.execute("DROP TABLE IF EXISTS customers;")
    cursor.execute("DROP TABLE IF EXISTS employees;")
    cursor.execute("DROP TABLE IF EXISTS departments;")

    # Create richer example tables
    cursor.execute("""
        CREATE TABLE products (
            product_id INTEGER PRIMARY KEY,
            product_name TEXT,
            category TEXT,
            price REAL
        );
    """)
    cursor.execute("""
        CREATE TABLE customers (
            customer_id INTEGER PRIMARY KEY,
            name TEXT,
            email TEXT,
            country TEXT,
            signup_date TEXT
        );
    """)
    cursor.execute("""
        CREATE TABLE orders (
            order_id INTEGER PRIMARY KEY,
            customer_id INTEGER,
            order_date TEXT,
            total_amount REAL,
            FOREIGN KEY(customer_id) REFERENCES customers(customer_id)
        );
    """)
    cursor.execute("""
        CREATE TABLE order_items (
            order_item_id INTEGER PRIMARY KEY,
            order_id INTEGER,
            product_id INTEGER,
            quantity INTEGER,
            price REAL,
            FOREIGN KEY(order_id) REFERENCES orders(order_id),
            FOREIGN KEY(product_id) REFERENCES products(product_id)
        );
    """)
    cursor.execute("""
        CREATE TABLE employees (
            employee_id INTEGER PRIMARY KEY,
            name TEXT,
            department_id INTEGER,
            hire_date TEXT
        );
    """)
    cursor.execute("""
        CREATE TABLE departments (
            department_id INTEGER PRIMARY KEY,
            department_name TEXT
        );
    """)

    # Populate with mock data
    cursor.executemany("INSERT INTO products VALUES (?, ?, ?, ?);", [
        (1, 'Widget A', 'Widgets', 25.0),
        (2, 'Widget B', 'Widgets', 30.0),
        (3, 'Gadget X', 'Gadgets', 45.0),
        (4, 'Gadget Y', 'Gadgets', 50.0),
        (5, 'Thingamajig', 'Tools', 15.0)
    ])
    cursor.executemany("INSERT INTO customers VALUES (?, ?, ?, ?, ?);", [
        (1, 'Alice', '[email protected]', 'USA', '2023-10-01'),
        (2, 'Bob', '[email protected]', 'Canada', '2023-11-15'),
        (3, 'Charlie', '[email protected]', 'USA', '2024-01-10'),
        (4, 'Diana', '[email protected]', 'UK', '2024-02-20')
    ])
    cursor.executemany("INSERT INTO orders VALUES (?, ?, ?, ?);", [
        (1, 1, '2024-04-03', 100.0),
        (2, 2, '2024-04-12', 150.0),
        (3, 1, '2024-04-15', 120.0),
        (4, 3, '2024-04-20', 180.0),
        (5, 4, '2024-04-28', 170.0)
    ])
    cursor.executemany("INSERT INTO order_items VALUES (?, ?, ?, ?, ?);", [
        (1, 1, 1, 2, 25.0),
        (2, 1, 2, 1, 30.0),
        (3, 2, 3, 2, 45.0),
        (4, 3, 4, 1, 50.0),
        (5, 4, 5, 3, 15.0),
        (6, 5, 1, 1, 25.0)
    ])
    cursor.executemany("INSERT INTO employees VALUES (?, ?, ?, ?);", [
        (1, 'Eve', 1, '2022-01-15'),
        (2, 'Frank', 2, '2021-07-23'),
        (3, 'Grace', 1, '2023-03-10')
    ])
    cursor.executemany("INSERT INTO departments VALUES (?, ?);", [
        (1, 'Sales'),
        (2, 'Engineering'),
        (3, 'HR')
    ])

    conn.commit()
    conn.close()

def run_query(query):
    try:
        conn = sqlite3.connect(DB_PATH)
        df = pd.read_sql_query(query, conn)
        conn.close()
        return df.head().to_string(index=False)
    except Exception as e:
        return f"Query failed: {e}"

def get_db_schema(db_path):
    conn = sqlite3.connect(db_path)
    cursor = conn.cursor()
    schema = ""
    cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
    tables = cursor.fetchall()
    for table_name, in tables:
        cursor.execute(f"SELECT sql FROM sqlite_master WHERE type='table' AND name='{table_name}';")
        create_stmt = cursor.fetchone()[0]
        schema += create_stmt + ";\n\n"
    conn.close()
    return schema

def get_structured_schema(db_path):
    conn = sqlite3.connect(db_path)
    cursor = conn.cursor()
    cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
    tables = cursor.fetchall()
    lines = ["Available tables and columns:"]
    for table_name, in tables:
        cursor.execute(f"PRAGMA table_info({table_name})")
        columns = [row[1] for row in cursor.fetchall()]
        lines.append(f"- {table_name}: {', '.join(columns)}")
    conn.close()
    return '\n'.join(lines)

if __name__ == "__main__":
    setup_sample_db()
    print("Sample database created.")

LLM’s charge by tokens – simple text fragments. For any LLM out there, there is a pricing model based on the number of input and output tokens, typically billed per million tokens. For a complete pricing list of all OpenAI models, refer to their official pricing page here. For gpt-4o-mini, the input tokens cost $0.15/M while the output tokens cost $0.60/M. To process the total costs for an LLM request, I created the below helper functions in helper.py to calculate the total cost based on the token usage in a request.

import re

def extract_token_counts(token_usage_str):
    prompt = completion = 0
    prompt_match = re.search(r'prompt_tokens=(\d+)', token_usage_str)
    completion_match = re.search(r'completion_tokens=(\d+)', token_usage_str)
    if prompt_match:
        prompt = int(prompt_match.group(1))
    if completion_match:
        completion = int(completion_match.group(1))
    return prompt, completion

def calculate_gpt4o_mini_cost(prompt_tokens, completion_tokens):
    input_cost = (prompt_tokens / 1000) * 0.00015
    output_cost = (completion_tokens / 1000) * 0.0006
    return input_cost + output_cost

The app.py file creates a powerful Streamlit application that will allow the user to prompt the SQLite database using natural language. Behind the scenes, my set of CrewAI agents is set in motion. After the first agent generates a SQL query, it is displayed on the App for the user. The user will have three options:

Confirm & Review — if the user finds the query acceptable and wants to continue
Try Again — if the user is not satisfied with the query and wants the agent to generate a new query again
Abort — if the user wants to stop the process here

Along with the above options, the LLM cost incurred for this request is shown on the screen. Once the user clicks the “Confirm & Review” button, the SQL query will go through the next two levels of review. The reviewer agent optimizes it for correctness and efficiency followed by the compliance agent that checks for compliance. If the query is compliant, it will be executed on the SQLite database. The final results and the cumulative LLM costs incurred in the entire process are displayed on the app interface. The user is not only in control during the process but is also cost-conscious.

import streamlit as st
from crew_setup import sql_generator_crew, sql_reviewer_crew, sql_compliance_crew
from utils.db_simulator import get_structured_schema, run_query
import sqlparse
from utils.helper import extract_token_counts, calculate_gpt4o_mini_cost

DB_PATH = "data/sample_db.sqlite"

# Cache the schema, but allow clearing it
@st.cache_data(show_spinner=False)
def load_schema():
    return get_structured_schema(DB_PATH)

st.title("SQL Assistant Crew")

st.markdown("""
Welcome to the SQL Assistant Crew!  
This app lets you interact with your database using natural language. Simply type your data question or request (for example, "Show me the top 5 products by total revenue for April 2024"), and our multi-agent system will:
1. **Generate** a relevant SQL query for your request,
2. **Review** and optimize the query for correctness and performance,
3. **Check** the query for compliance and data safety,
4. **Execute** the query (if compliant) and display the results.

You can also refresh the database schema if your data changes.  
This tool is perfect for business users, analysts, and anyone who wants to query data without writing SQL by hand!
""")

st.write("The schema of the database is saved. If you believe the schema is incorrect, you can refresh it by clicking the button below.")
# Add a refresh button
if st.button("Refresh Schema"):
    load_schema.clear()  # Clear the cache so next call reloads from DB
    st.success("Schema refreshed from database.")

# Always get the (possibly cached) schema
db_schema = load_schema()

with st.expander("Show database schema"):
    st.code(db_schema)

st.write("Enter your request in natural language and let the crew generate, review, and check compliance for the SQL query.")

if "generated_sql" not in st.session_state:
    st.session_state["generated_sql"] = None
if "awaiting_confirmation" not in st.session_state:
    st.session_state["awaiting_confirmation"] = False
if "reviewed_sql" not in st.session_state:
    st.session_state["reviewed_sql"] = None
if "compliance_report" not in st.session_state:
    st.session_state["compliance_report"] = None
if "query_result" not in st.session_state:
    st.session_state["query_result"] = None
if "regenerate_sql" not in st.session_state:
    st.session_state["regenerate_sql"] = False
if "llm_cost" not in st.session_state:
    st.session_state["llm_cost"] = 0.0

user_prompt = st.text_input("Enter your request (e.g., 'Show me the top 5 products by total revenue for April 2024'):")

# Automatically regenerate SQL if 'Try Again' was clicked
if st.session_state.get("regenerate_sql"):
    if user_prompt.strip():
        try:
            gen_output = sql_generator_crew.kickoff(inputs={"user_input": user_prompt, "db_schema": db_schema})
            raw_sql = gen_output.pydantic.sqlquery
            st.session_state["generated_sql"] = raw_sql
            st.session_state["awaiting_confirmation"] = True
            st.session_state["reviewed_sql"] = None
            st.session_state["compliance_report"] = None
            st.session_state["query_result"] = None
            # LLM cost tracking
            token_usage_str = str(gen_output.token_usage)
            prompt_tokens, completion_tokens = extract_token_counts(token_usage_str)
            cost = calculate_gpt4o_mini_cost(prompt_tokens, completion_tokens)
            st.session_state["llm_cost"] += cost
            st.info(f"Your LLM cost so far: ${st.session_state['llm_cost']:.6f}")
        except Exception as e:
            st.error(f"An error occurred: {e}")
    else:
        st.warning("Please enter a prompt.")
    st.session_state["regenerate_sql"] = False

# Step 1: Generate SQL
if st.button("Generate SQL"):
    if user_prompt.strip():
        try:
            gen_output = sql_generator_crew.kickoff(inputs={"user_input": user_prompt, "db_schema": db_schema})
            # st.write(gen_output)  # Optionally keep for debugging
            raw_sql = gen_output.pydantic.sqlquery
            st.session_state["generated_sql"] = raw_sql
            st.session_state["awaiting_confirmation"] = True
            st.session_state["reviewed_sql"] = None
            st.session_state["compliance_report"] = None
            st.session_state["query_result"] = None
            # LLM cost tracking
            token_usage_str = str(gen_output.token_usage)
            prompt_tokens, completion_tokens = extract_token_counts(token_usage_str)
            cost = calculate_gpt4o_mini_cost(prompt_tokens, completion_tokens)
            st.session_state["llm_cost"] += cost
        except Exception as e:
            st.error(f"An error occurred: {e}")
    else:
        st.warning("Please enter a prompt.")

# Only show prompt and generated SQL when awaiting confirmation
if st.session_state.get("awaiting_confirmation") and st.session_state.get("generated_sql"):
    st.subheader("Generated SQL")
    formatted_generated_sql = sqlparse.format(st.session_state["generated_sql"], reindent=True, keyword_case='upper')
    st.code(formatted_generated_sql, language="sql")
    st.info(f"Your LLM cost so far: ${st.session_state['llm_cost']:.6f}")
    col1, col2, col3 = st.columns(3)
    with col1:
        if st.button("Confirm and Review"):
            try:
                # Step 2: Review SQL
                review_output = sql_reviewer_crew.kickoff(inputs={"sql_query": st.session_state["generated_sql"],"db_schema": db_schema})
                reviewed_sql = review_output.pydantic.reviewed_sqlquery
                st.session_state["reviewed_sql"] = reviewed_sql
                # LLM cost tracking for reviewer
                token_usage_str = str(review_output.token_usage)
                prompt_tokens, completion_tokens = extract_token_counts(token_usage_str)
                cost = calculate_gpt4o_mini_cost(prompt_tokens, completion_tokens)
                st.session_state["llm_cost"] += cost
                # Step 3: Compliance Check
                compliance_output = sql_compliance_crew.kickoff(inputs={"reviewed_sqlquery": reviewed_sql})
                compliance_report = compliance_output.pydantic.report
                # LLM cost tracking for compliance
                token_usage_str = str(compliance_output.token_usage)
                prompt_tokens, completion_tokens = extract_token_counts(token_usage_str)
                cost = calculate_gpt4o_mini_cost(prompt_tokens, completion_tokens)
                st.session_state["llm_cost"] += cost
                # Remove duplicate header if present
                lines = compliance_report.splitlines()
                if lines and lines[0].strip().lower().startswith("# compliance report"):
                    compliance_report = "\n".join(lines[1:]).lstrip()
                st.session_state["compliance_report"] = compliance_report
                # Only execute if compliant
                if "compliant" in compliance_report.lower():
                    result = run_query(reviewed_sql)
                    st.session_state["query_result"] = result
                else:
                    st.session_state["query_result"] = None
                st.session_state["awaiting_confirmation"] = False
                st.info(f"Your LLM cost so far: ${st.session_state['llm_cost']:.6f}")
                st.rerun()
            except Exception as e:
                st.error(f"An error occurred: {e}")
    with col2:
        if st.button("Try Again"):
            st.session_state["generated_sql"] = None
            st.session_state["awaiting_confirmation"] = False
            st.session_state["reviewed_sql"] = None
            st.session_state["compliance_report"] = None
            st.session_state["query_result"] = None
            st.session_state["regenerate_sql"] = True
            st.rerun()
    with col3:
        if st.button("Abort"):
            st.session_state.clear()
            st.rerun()

# After review, only show reviewed SQL, compliance, and result
elif st.session_state.get("reviewed_sql"):
    st.subheader("Reviewed SQL")
    formatted_sql = sqlparse.format(st.session_state["reviewed_sql"], reindent=True, keyword_case='upper')
    st.code(formatted_sql, language="sql")
    st.subheader("Compliance Report")
    st.markdown(st.session_state["compliance_report"])
    if st.session_state.get("query_result"):
        st.subheader("Query Result")
        st.code(st.session_state["query_result"])
    # LLM cost display at the bottom
    st.info(f"Your LLM cost so far: ${st.session_state['llm_cost']:.6f}")

Here is a quick demo of the app in action. I asked it to display the top products based on total sales. The assistant generated a SQL query, and I clicked on “Confirm and Review”. The query was already well optimised so the Reviewer agent returned the same query without any modifications. Next, the Compliance Check agent reviewed the query and confirmed it was safe to run — no risky operations or exposure of sensitive data. After passing the two reviews, the query was run against the sample database and the results were displayed. For this entire process, the LLM usage cost was just $0.001349.

Here’s another example where I ask the app to identify which products have the most returns. However, there is no information in the schema about returns. As a result, the assistant does not generate a query and states the same reason. Till this stage, the LLM cost was $0.00853. Since there’s no point in reviewing or executing a non-existent query, I simply clicked “Abort” to end the process gracefully.

CrewAI is incredibly powerful for building multi-agent systems. By pairing it with Streamlit, one can easily create a simple interactive UI on top to work with the system. In this POC, I explored how to add a human-in-loop element to maintain control and transparency throughout the workflow. I also tracked how many tokens were consumed at each step helping the user stay cost-conscious during the process. With the help of a compliance agent, I enforced some basic safety measures by blocking risky or PII-exposure-related queries. I tuned the temperature of the model and iteratively refined the task descriptions to improve the output quality and reduce hallucinations. Is it perfect? The answer is no. There are still some times when the system hallucinates. If I implement this at scale, then the LLM cost would be a bigger concern. In real life, the databases are complex, and as such their schema will also be huge. I would have to explore working with RAG (Retrieval Augmented Generation) to feed only relevant schema snippets to the LLM, optimizing agent memory, and using caching to avoid redundant API calls.

Final Thoughts

This was a fun project that combines the power of LLMs, the practicality of Streamlit, and the modular intelligence of CrewAI. If you’re interested in building intelligent agents for data interaction, give it a try — or fork the repo and build on it!

Before you go…

Follow me so you don’t miss any new posts I write in future; you will find more of my articles on my profile page. You can also connect with me on LinkedIn or X!