Why AI Still Can’t Solve Your Real Mathematical Optimization Problem

Contents

The Promise of AI-Powered Optimization Where Existing Tools Break Down Introducing ORPilot Stage 1: Interview Agent Stage 2: Data Collection Agent Stage 3: Parameter Computation Agent Stage 4: Code Generation Agent Stage 5: Reporter Agent Why This Order Matters What This Looks Like at Scale Getting Started

to use AI to build a mathematical optimization model for a real business problem, you’ve probably run into the same wall: the AI works beautifully on textbook examples and falls apart the moment you hand it your actual data and your actual problem.

That gap isn’t a coincidence. It’s by design, and it’s the reason why I built ORPilot.

The Promise of AI-Powered Optimization

Operations Research (OR) has been quietly powering some of the most impactful decisions in business for decades — routing delivery trucks, scheduling factory production, designing supply chains, allocating cargo to carriers. The math is mature and the solvers are excellent. The bottleneck has always been the human expertise required to translate a business problem into a mathematical model.

Large Language Models (LLMs) seemed like the perfect solution. A growing body of research, including the OptiMUS series, OR-LLM, and others, has shown that state-of-the-art LLMs can generate correct solver code for well-specified linear programming (LP) and mixed integer programming (MIP) problems. The results looked impressive. The demos were compelling.

Then you try to use one of these tools on a real problem, and the cracks appear immediately.

Where Existing Tools Break Down

Almost every LLM-for-OR tool built to date shares a hidden assumption: the problem description is complete, unambiguous, and handed to the AI in a single, well-formatted prompt with all the data neatly embedded inline.

That is not how real OR problems work. Not even close.

Consider what actually happens when a supply chain team wants to build an optimization model:

The problem description is incomplete and ambiguous. A business analyst will say “we want to minimize transportation costs” and forget to mention that each distribution center has a throughput limit, that some routes don’t exist, or that opening a facility incurs a one-time fixed cost. These omissions aren’t
carelessness. They’re assumptions the analyst considers obvious, which is exactly why they’re dangerous. An AI system that starts modeling before these details are nailed down produces a model that is technically correct but practically wrong.
The data is too large to fit in a prompt. A real supply chain problem might involve hundreds of production sites, distribution centers, customers, and thousands of products over multiple periods. The demand table alone might have millions of entries. You cannot embed that in a prompt. Even if you could, flooding the context window with raw data dramatically increases the risk of hallucinations.
The data you have is not the data the model needs. The model might need a distance matrix between all pairs of locations. What you have is a table of GPS coordinates. The model might need aggregate demand by product and period. What you have is a transaction ledger with one row per order. Bridging this gap, namely computing derived parameters from raw data, is a significant engineering step that no existing LLM-for-OR tool handles automatically.
Once you have a working model, portability and reproducibility matter. If you want to re-run the model on updated data, switch from Gurobi to an open-source solver, or hand the model off to a colleague on a different machine, you’re back to square one unless the tool produces a durable, solver-agnostic artifact. Most tools produce solver-specific code and nothing else.

These aren’t edge cases. They are the standard conditions for any real-world OR deployment. Existing LLM-for-OR tools were built for a different world, a textbook world, and they show their seams the moment they leave it.

Introducing ORPilot

ORPilot is an open-source AI agent built from the ground up for production conditions. It is, to my knowledge, the first LLM-based OR tool designed explicitly for the messy, large-scale, data-heavy reality of industrial optimization.

Most AI tools for optimization jump straight to writing code the moment you describe your problem. ORPilot does something different: it asks questions first.

That design decision, prioritizing understanding over speed, reflects a single guiding principle: an AI agent should work the same way a skilled human OR consultant would.

A good consultant doesn’t walk into a client meeting and start writing a mathematical model on the whiteboard. They ask questions. They listen carefully. They push back when something
is ambiguous. They make sure the data is in the right shape before the modeling begins. Only after all of that do they pick up the pen.

ORPilot’s pipeline reflects this discipline through five sequentially connected stages.

Stage 1: Interview Agent

The interview agent is the entry point. It receives your initial description of the business problem, which can be vague, incomplete, or even self-contradictory, and engages you in a
structured dialog to fill in the gaps. The key design principle is no modeling begins until the interview is complete.

The agent is prompted to identify information gaps in the current description, ask at most one targeted clarifying question per turn (to avoid overwhelming you), and terminate once the objective function, decision variables, constraints, and data requirements are all unambiguously specified.

In practice, this means conversations like:

ORPilot: “Once a facility is opened, does it remain open for all subsequent periods, or can it be closed later?”

ORPilot: “Does this model handle a single product type or multiple products?”

ORPilot: “You mentioned a transportation cost. Is this cost per unit shipped, per shipment regardless of quantity, or something else?”

Before ending the interview, the agent presents a full structured summary with objective function, decision variables, constraints, parameters, indices, and gives you the chance to correct anything before that summary is passed downstream. This is the guard against the most common failure mode in LLM-for-OR tools: modeling the wrong problem.

Stage 2: Data Collection Agent

This stage has no counterpart in most of existing LLM-for-OR tool. It is one of the most important structural innovations in ORPilot.

Most existing LLM-for-OR tools assume the data is embedded in the problem text, small enough to fit in a prompt. For textbook problems, this works. For real problems, it breaks down in two ways. First, real datasets are too large. For example, a 500-customer, 500-product, 12-period supply chain problem would have 3,000,000 demand entries. Second, embedding data in the prompt inflates hallucination risk and burns through context window unnecessarily.

ORPilot’s answer is to treat data as separate from the prompt entirely. Data lives in CSV files. The AI accesses it only by writing and executing code. The data collection agent’s job is to figure out exactly what those CSV files need to look like.

Based on the problem specification from the interview agent, the data collection agent determines:

Which entities (sets) exist in the model
What attributes (parameters) each entity needs
The precise schema for each required table: column names, types, semantics

It presents this specification to you and waits until you’ve supplied all the files in the correct format. It validates completeness before proceeding.

Crucially, the agent is flexible: if you don’t have a particular piece of model-ready data (say, the model needs a distance matrix but you only have GPS coordinates), you tell the agent what you actually have, and it updates the schema accordingly — passing the gap to the next stage to handle.

Stage 3: Parameter Computation Agent

Almost every existing LLM-for-OR tool assumes the numerical quantities needed by the model appear directly in the user-supplied data. In practice, this is almost never true. Two examples that come up constantly in real OR problems:

A vehicle routing model needs a pairwise distance matrix. The user has GPS coordinates. Computing Euclidean or geographic distances is a transformation entirely outside the scope of LP/MIP formulation.
A multi-period production model needs aggregate demand per period. The user has a transaction ledger with one row per order. The model parameter is a sum-aggregation that has to be computed from the raw data.

The parameter computation agent bridges this gap automatically. It receives the problem specification and the raw CSV files, then:

Identifies which model parameters cannot be read directly from the raw tables
Generates a Python script to compute those derived parameters
Executes the script in a sandboxed environment
Writes the results as additional CSV files, passed to the modeling step

This ensures that by the time the modeling agent sees the data, it is clean, correctly typed, correctly indexed, and model-ready. In our experiments, this step substantially reduced code generation failures and retry counts.

Another common situation where the parameter computation agent could be useful is computing BigM values. In some experiments that I did on ORPilot, the parameter computation agent computed a BigM value needed for constraints linking continuous shipment variables to binary facility-opening decisions. This is a derived parameter that would be impractical to ask the user to provide directly.

Stage 4: Code Generation Agent

With a complete problem specification, raw data, and derived parameters all in hand, the code generation agent produces a complete Python solver script for your chosen backend. ORPilot currently supports five backends: Gurobi, CPLEX, PuLP, Pyomo, and OR-Tools.

The generated code is immediately executed in a sandbox. If anything goes wrong: syntax error, runtime exception, or an infeasible/unbounded solver result, the full error message and traceback are fed back to the LLM along with the previously generated code. The agent retries, up to a user-configurable maximum number of attempts.

In practice, the majority of failures are resolved within one or two retries. The key reason ORPilot’s retry loop is effective is that the upstream stages have already done the hard work: the problem is correctly specified, the data is model-ready, and the agent only
needs to fix a code-level mistake rather than rethink the entire model structure.

Stage 5: Reporter Agent

After a successful solve, a reporter agent translates the numerical results into plain English, explaining which facilities to open, what routes to use, what quantities to produce, in the domain language of the original business problem, for consumption by a business user rather than an OR expert.

Why This Order Matters

The pipeline is deliberately sequential. Each stage is gated on the previous one completing successfully. The interview must finish before data collection begins. Data must be validated before parameter computation runs. Parameters must be ready before code is generated.

This sequencing prevents the most common failure mode in LLM-based OR tools: cascading errors where an ambiguous problem description propagates through the pipeline and produces code that is syntactically valid but models the wrong objective.

What This Looks Like at Scale

I tested ORPilot on a few OR problems, one of which is a supply chain network design problem with 50 production sites, 50 distribution centers, 500 customers, 500 products, 12 periods. The resulting model had more than 9.7 million decision variables and 963,000 constraints. ORPilot successfully handled the full pipeline end to end, from the initial conversation through data collection, parameter computation, code generation, and solution reporting, producing an optimal solution with Gurobi. Check out my paper here https://arxiv.org/abs/2605.02728 to see the results of more test problems.

Getting Started

ORPilot is open source and available now:

GitHub: https://github.com/GuangruiXieVT/ORPilot
Paper: https://arxiv.org/abs/2605.02728

Installation takes a few minutes. ORPilot supports OpenAI, Anthropic, Google, and DeepSeek as LLM providers, and Gurobi, CPLEX, PuLP, Pyomo, and OR-Tools as solver backends.

In the next post in this series, we’ll take a deep dive on the Intermediate Representation (IR) — the solver-agnostic JSON artifact that makes ORPilot’s results reproducible and portable across backends without ever calling the LLM again. Stay tuned!