on a feature where I had to transform 100 messy compliance pdfs into structured JSON rules.
The brute force approach was obvious: give the agent the source text, explain the task, provide examples, and ask it to generate the rules. Since it was the lowest-hanging fruit, I tried it first.
At a glance, the output looked fine. The output JSON was valid and matched what I expected.
But as I was manually sampling the results to check for accuracy, the cracks appeared. Some rules were too broad, others were missed. Some rules failed to preserve the nuances of the original text. I tried using another agent to catch and fix the errors but with such a huge corpus, it was impossible to confidently verify the output.
That was the frustrating part. The errors were not obvious. This was way too fragile of an implementation to scale.
Though I cannot share the exact implementation details, what I can share are the architectural lessons I learnt and how I eventually implemented it. Hopefully, these insights will be useful if you’re building AI systems that need to scale, stay reliable, and deal with messy data. And if you have better ways of doing things, do reach out to chat!
Okay let’s get to it.
The problem
The 100 pdfs I worked with had already been parsed and chunked before they reached me. But the raw content was still messy. There were bullet points, tables, OCR artefacts, translated sections, semi structured headings, footers, headers, inconsistent formatting and document specific quirks.
I chose to use an agent because deciding what mattered required semantic judgement. The documents did not follow one consistent pattern, so relevance could not be determined through simple rules alone.
You had to understand the surrounding context. None of this was difficult when done on a small chunk of data. The challenge was performing this reliably at scale.
These rules were then processed by another downstream system to be evaluated deterministically.
What eventually worked
After a few experiments, I realised the biggest improvement did not come from a better prompt, a new tool, an MCP server, or a more sophisticated agent harness.
It came from changing the shape of the problem.
Instead of trying to make the agent smarter, I made the agent’s job smaller.
The first change was to prepare the source data upfront. Instead of asking the agent to query a database, retrieve records, decide whether it had the right inputs, and then perform the extraction, I gave it a more controlled starting point.
In my case, that meant temporarily storing the relevant raw data locally.
This may not always be practical. But the underlying principle is to reduce the amount of retrieval uncertainty the agent has to handle. If the agent’s job is to reason over content, do not also make it responsible for figuring out whether it has found the right content.
Another option would be to prepare the query upfront.
I also used a script to strip away unnecessary metadata and fields before passing the raw content to the agent. Less irrelevant context meant fewer distractions, fewer chances for the agent to latch onto the wrong details and a cleaner reasoning task overall.
But the most important change was the unit of work.
Instead of processing everything at once, I did things iteratively and processed one document at a time.
That made each job smaller, easier to inspect, easier to retry, and easier to audit. I spun up five subagents to process documents in parallel, with each agent logging its progress to a file.
If one document failed, I could retry only that document. If one output had formatting issues, I could fix that specific case without rerunning the whole batch. If the pipeline stopped halfway, the cached progress meant it could resume from the last successful checkpoint.
This was also where the separation of responsibilities became clearer.
The agent handled the semantic work: understanding the content, identifying the relevant parts and writing the JSON output.
The surrounding code handled the mechanical parts: parallelising jobs, enforcing the schema, generating IDs, writing files, caching progress, validating references, and checking whether the output could be traced back to the original source.
I also had an orchestrator watch over the progress of the script.
Making the output auditable
A useful design decision was adding reference IDs to every generated rule. This meant that each output item pointed back to a specific source.
This made the output easier to audit. Instead of asking, “Does this generated rule look right?”, I could ask more precise questions such as: does the referenced source chunk exist? Is the quoted source text actually present in that chunk?
I could also get another agent to selectively run audits on larger and more complex documents to ensure that important nuances were preserved.
On top of that, I did a lightweight version of evals. I ran a small batch of raw documents through the workflow and manually reviewed the results for coverage and accuracy. A full golden dataset was not practical for the scope of this task, but I still needed a way to prove to myself that the workflow was working.
My goal was not to build a perfect benchmark but to make the system auditable enough that I could inspect the outputs, catch failures, and iterate toward a higher accuracy bar.
If you’ve got ideas on how I could have done this better, let me know!
My biggest takeaway
The pattern that worked was to stop treating the LLM as the whole system.
The system became more reliable not because the agent became perfect, but because the workflow made its outputs easier to trace, validate, and recover from.
Coincidentally, I was building this shortly before attending the inaugural AI Engineer Singapore conference, held from 15–17 May 2026.
On the last day, JJ Geewax, Director of Applied AI at Google DeepMind, shared a framing that captured what I had been learning the hard way: we need to stop using LLMs like giant problem solvers.
That resonated with me because it is such an easy trap to fall into. It is easy to just give the model the data, schema, business rules, edge cases, and the responsibility to verify itself. Then get frustrated when the result is inconsistent.
But for reliable production systems, the better pattern is usually a hybrid. Let the agent handle the parts that require semantic judgement, and let code handle the parts that require structure, validation, and control.
I’ll be sharing more reflections from AI Engineer Singapore and the workshops I attended. The YouTube snippet of JJ’s speech here.
That’s all from me. I hope this helped, and see you in the next article 🙂