From Local LLM to Tool-Using Agent

Editor
13 Min Read


a local LLM. Nice.

But after the first few chats, you might be wondering: what else can I do with it?

Well, how about making the local LLM agentic with some tool use?

In this post, we’ll explore how to turn a local LLM into a tool-using agent. Specifically, we’ll use

  • Gemma 4 model (edge-friendly variants) as our local LLM
  • Ollama for serving the local LLM
  • OpenAI Agents SDK for the agent runtime
  • Tavily web search MCP as one example of the external tool

We’ll build a mini deep research agent that can search the web, gather the evidence, and synthesize an answer with citations, given a user question.

By the end of the post, you’d have a working local deep research agent and a reusable implementation pattern for turning a local model into a local AI agent.

Figure 1. The architecture of the local agent. (Image by author)

If you are interested in a local coding-agent setup, I previously covered Gemma 4 + OpenCode. In this post, we focus on the more general pattern of connecting a local model to an agent runtime and external tools.


1. Set Up the Local Agent Stack

We need to prepare 4 pieces before we write the code: Ollama, Gemma 4 (specifically the Gemma 4 E4B model), OpenAI Agents SDK, and Tavily MCP.

First, let’s install Ollama.

On Windows, you can download the installer from the official Ollama website:

https://ollama.com/download

Or use winget in PowerShell:

winget install Ollama.Ollama

On Linux, Ollama can be installed with:

"curl -fsSL https://ollama.com/install.sh | sh"

After installation, please check:

ollama --version

On Windows, remember to launch Ollama from the Start menu. Once it is running, the local API endpoint is available.

Next, we pull the local model. Here, we use Gemma 4 E4B variant:

ollama pull gemma4:e4b

Gemma 4 has several variants. The E4B model is a good fit for our purpose, as it is designed with edge/local agentic workflows in mind. My machine has an NVIDIA RTX 2000 Ada Laptop GPU with about 8 GB VRAM. If your machine is more constrained, you can try the lighter E2B variant:

ollama pull gemma4:e2b

Next, we need the agent runtime library. For that, we use OpenAI Agents SDK:

pip install openai-agents

You would also need the OpenAI-compatible client:

pip install openai

Something to note here: later, we’ll point the client to Ollama’s local endpoint, so this does not mean we are sending model calls to OpenAI.

Finally, we need a Tavily MCP endpoint. In case you have not used it before, Tavily is a search API designed for LLM applications. In this post, we use its MCP server so the agent can search the web.

You’d need to first create a Tavily account and get an API key. On the Tavily platform, you can directly generate a MCP link with the following shape:

https://mcp.tavily.com/mcp/?tavilyApiKey=<your-api-key>

Now we are ready.

Using Tavily here is not a sponsored choice; it is used here as one convenient MCP tool, the same pattern can work with other MCP-compatible tools as well.

In fact, the whole stack here is not the only option. Instead of using Ollama, you could serve the local model with LM Studio or llama.cpp. Instead of Gemma 4 models, you can also try with other models from, e.g., Qwen family. For agent framework, we also have options from Google or Anthropic. You could also connect different MCP tools instead of Tavily. I use this combination simply because I am familiar with that stack. But the important takeaway in this case study is the general local agentic pattern.


2. Configure the Local Research Agent

With OpenAI Agents SDK, this is the final Agent object we need to compose:

from agents import Agent

agent = Agent(
    name="Local Research Agent",
    instructions=RESEARCH_AGENT_INSTRUCTIONS,
    model=model,
    mcp_servers=[tavily_server],
    mcp_config={"include_server_in_tool_names": True},
)

Let’s unpack each part.

2.1 The Model

First, the model.

from openai import AsyncOpenAI
from agents import OpenAIChatCompletionsModel

MODEL_NAME = "gemma4:e4b"
OLLAMA_BASE_URL = "http://localhost:11434/v1"

client = AsyncOpenAI(
    api_key="ollama",
    base_url=OLLAMA_BASE_URL,
)

model = OpenAIChatCompletionsModel(
    model=MODEL_NAME,
    openai_client=client,
)

We start by creating a client that points at Ollama’s local OpenAI-compatible endpoint.

Then, we use OpenAIChatCompletionsModel to wrap the Gemma model into a model object. This allows the Agents SDK to use that model inside the agent loop.

Note that the api_key="ollama" value is just a placeholder. Ollama doesn’t really need a real OpenAI API key. We use it because the client expects this field.

2.2 The Instruction

Next, we define the instruction for the agent with the desired research behavior:

from datetime import datetime

CURRENT_DATE = datetime.now().strftime("%B %d, %Y")

# Note that this instruction is iterated with AI
RESEARCH_AGENT_INSTRUCTIONS = f"""
[Role]
You are a concise research assistant.

[Task]
Answer the user's question by turning it into a small web research task. 
Use the current date when interpreting time-sensitive questions: {CURRENT_DATE}.

[Research behavior]
Start with one targeted search query.
For recommendation or comparison questions, complete this research loop before answering: 
first identify the main options, then search for comparison context, then synthesize a recommendation.

Use follow-up searches when the first results are insufficient, conflicting, or only cover part of the question.

Prefer relevant and credible sources, and track which source supports each important claim.

Before answering, check whether the gathered evidence is enough to support the conclusion.

[Expected output]
Give a direct answer first, then briefly explain the evidence behind it. 
Include source links for key factual claims.

[Rules]
Do not rely on memory for facts that may have changed.
Do not invent missing details.
Keep the answer concise.
""".strip()

2.3 The Tools

Now we equip the agent with the web search tool. In this case, we use the Tavily search engine through MCP:

from agents import Agent, Runner
from agents.mcp import MCPServerStreamableHttp

TAVILY_MCP_URL = "YOUR_TAVILY_MCP_URL"

async with MCPServerStreamableHttp(
    name="tavily",
    params={"url": TAVILY_MCP_URL},
) as tavily_server:
    tools = await tavily_server.list_tools()

    print("Available Tavily tools:")
    for tool in tools:
        description = (tool.description or "").replace("\n", " ")
        print(f"- {tool.name}: {description[:120]}")

    agent = Agent(
        name="Local Research Agent",
        instructions=RESEARCH_AGENT_INSTRUCTIONS,
        model=model,
        mcp_servers=[tavily_server],
        mcp_config={"include_server_in_tool_names": True},
    )

    result = await Runner.run(agent, RESEARCH_QUESTION, max_turns=MAX_TURNS)

This code block does three things:

  1. It opens a connection to Tavily’s MCP server with async with MCPServerStreamableHttp(...) as tavily_server: Once connected, Tavily would expose its available tools to the Agents SDK.
  2. We create the Agent object inside the MCP context. Note that we have mcp_servers=[tavily_server], which attaches Tavily’s MCP tools to the agent.
  3. We finally run the agent with result = await Runner.run(agent, RESEARCH_QUESTION, max_turns=MAX_TURNS). The context manager matters here because the MCP connection is only active inside the async with block.

mcp_config={"include_server_in_tool_names": True} is mainly for readability in the trace. Without it, the tool name will only appear as tavily_search. With it, the tool name will show as mcp_tavily__tavily_search. This makes it clearer that the tool call came through the Tavily MCP server.


3. Run a Research Question

Now that the agent is configured, let’s test it with one concrete question:

“Which June 23, 2026 World Cup match had the biggest group-stage stakes, and why?”

To inspect what happened, I print a compact trace:

def compact(value: object, limit: int = 220) -> str:
    text = str(value).replace("\n", " ")
    return text if len(text) <= limit else text[:limit] + "..."


for step, item in enumerate(result.new_items, start=1):
    raw_item = getattr(item, "raw_item", None)
    raw_type = getattr(raw_item, "type", "")
    raw_name = getattr(raw_item, "name", "")
    raw_output = getattr(raw_item, "output", "")

    print(
        f"{step:02d} | {type(item).__name__} | "
        f"{raw_type or raw_name} | {compact(raw_output or raw_item)}"
    )

In my run, the trace looked like this:

01 | ToolCallItem | function_call | ResponseFunctionToolCall(arguments='{"query":"World Cup 2026 group stage matches June 23, 2026 stakes"}', name='mcp_tavily__tavily_search', ...)
02 | ToolCallOutputItem |  | {'call_id': ..., 'output': ...}
03 | MessageOutputItem | message | ResponseOutputMessage(... final answer ...)

This allows us to see the agentic behavior directly. In this run, the local Gemma model decided to call the Tavily search tool, the Agents SDK executed that tool call, and passed the results back to the model. Then, the model produced the final answer.

To see the final response, we can print:

print(result.final_output)

This is what the agent produced:

The match with the biggest group-stage stakes on June 23, 2026, was Colombia vs. DR Congo.

Why:
According to FIFA reporting, this specific match was highlighted as a critical moment where Colombia advanced into the knockout phase of the tournament. 
The article notes that Daniel Munoz scored the first goal for Colombia during this Group K fixture, which directly contributed to their progression in the competition.

Evidence
- FIFA: An article titled "Colombia v Congo DR Group K FIFA World Cup 2026" specifically reports on a key moment from this match, stating that Munoz's goal helped fire Colombia into the knockout phase.
  Source: https://digitalhub.fifa.com/transform/450614d3-72d7-4c1f-85ff-ea0fbee6f28d/Colombia-v-Congo-DR-Group-K-FIFA-World-Cup-2026?focuspoint=0.51

- Yahoo Sports: Confirms the fixture and result for that date: Colombia defeated DR Congo.
  Source: https://sports.yahoo.com/soccer/article/2026-world-cup-results-standings-and-schedule-live-scores-group-stage-updates-and-how-to-watch-050724193.html

Notice that the agent only made one search round in this run, as the search results already contained enough evidence for the model to answer. For more complex questions, multiple rounds of search and reasoning would be necessary, and our current framework naturally supports that.


4. Wrapping Up

A local LLM doesn’t have to stay as a chat model.

In this post, we deployed a Gemma 4 E4B model locally through Ollama, then we put the model inside an agent runtime provided by OpenAI Agents SDK, and we gave the agent a web search tool so that it can find information online to answer users’ questions.

From here, you can easily extend this pattern with stronger research instructions or build a more explicit planning-reflection workflow, if you want to keep working in the direction of deep research, or you can connect the agent to more MCP tools for many other use cases.

Happy building!


Reference

Ollama: https://ollama.com/

Gemma model family: https://ai.google.dev/gemma

OpenAI Agents SDK: https://openai.github.io/openai-agents-python/

Agents SDK MCP docs: https://openai.github.io/openai-agents-python/mcp/

Tavily MCP docs: https://docs.tavily.com/documentation/mcp

Share this Article
Please enter CoinGecko Free Api Key to get this plugin works.