Ten Lessons of Building LLM Applications for Engineers

Contents

Phase 1: Before You Start Lesson 1: Not every problem can or should be addressed by LLMs Lesson 2: Set the right mindset from day one Lesson 3: Co-design with experts and define what “better” means Phase 2: During The Project Lesson 4: It’s Co-pilot, not Auto-pilot Lesson 5: Focus on workflow, roles, and data flow before picking a framework Lesson 6: Try workflows before jumping to agents Lesson 7: Structure everything you can – inputs, outputs, and knowledge Lesson 8: Don’t forget about analytical AI Phase 3: After You Have Built Lesson 9: Integrate where engineers actually work Lesson 10: Evaluation, evaluation, evaluation Conclusion

across industries. Traditional engineering domains are no exception.

In the past two years, I’ve been building LLM-powered tools with engineering domain experts. Those are process engineers, reliability engineers, cybersecurity analysts, etc., who spend most of their day in logs, specs, schematics, and reports, and doing tasks such as troubleshooting, failure mode analysis, test planning, compliance checks, etc.

The promise is compelling: thanks to its extensive pre-trained knowledge, the LLMs can, in theory, reason like domain experts and accelerate the tedious, pattern-matching parts of engineering work, and free up experts for higher-order decisions.

The practice, however, is messier. “Just add a chatbox” rarely translates into useful engineering tools. There is still quite a large gap between an impressive demo and a system that engineers actually trust and use.

It has everything to do with how to frame the problem, how to structure the workflow, and how to integrate it into the engineer’s real environment.

In this post, I’d like to share 10 lessons I learned from my past projects. They are just my collection of “field notes” rather than a comprehensive checklist. But if you also plan to build or are currently building LLM applications for domain experts, I hope those lessons could help you avoid a few painful dead ends.

Our roadmap. (Image by author)

I organize the lessons into three phases, which exactly align with the stages of a typical LLM project:

Before you start: frame the right problem and set the right expectation.
During the project: design clear workflows and enforce structure everywhere.
After you have built: integrate where engineers work and evaluate with real cases.

With that in mind, let’s get started.

Phase 1: Before You Start

What you do before even writing a single line of code largely shapes whether an LLM project will succeed or fail.

That means if you are chasing the wrong problem or failing to set the right expectation upfront, your application will struggle to gain traction later, no matter how technically sound you make it.

In the following, I’d like to share some lessons on laying the right foundation.

Lesson 1: Not every problem can or should be addressed by LLMs

When I look at a new use case from engineers, I’d always try very hard to challenge my “LLM-first” reflex and really ask myself: can I solve the problem without using LLMs?

For the core reasoning logic, that is, the decision-making bottleneck you want to automate, there are usually at least three classes of methods you can consider:

Rule-based and analytical methods
Data-driven ML models
LLMs

Rule-based and analytical methods are cheap, transparent, and easy to test. However, they might be inflexible and only possess limited power in the messy reality.

Classic ML models, even a simple regression or classification, can often give you fast, reliable, and easily scalable decisions. However, they require historical data (and quite often, also the labels) to learn the patterns.

LLMs, on the other hand, shine if the core challenge is about understanding, synthesizing, or generating language across messy artifacts. Think skimming through 50 incident reports to surface likely relevant ones, or turning free-text logs into labeled, structured events. But LLMs are expensive, slow, and usually don’t behave deterministically as you might want.

Before deciding to use an LLM for a given problem, ask yourself:

Could 80% of the problem be solved with a rule engine, an analytical model, or a classic model? If yes, simply start there. You can always layer an LLM on top later if needed.
Does this task require precise, reproducible numerical results? If so, then keep the computation in analytical code or ML models, and use LLMs only for explanation or contextualization.
Will there be no human in the loop to review and approve the output? If that’s the case, then an LLM might not be a good choice as it rarely provides strong guarantees.
At our expected speed and volume, would LLM calls be too expensive or too slow? If you need to process thousands of log lines or alerts per minute, relying on LLM alone will quickly make you hit a wall on both cost and latency.

If your answers are mostly “no”, you’ve probably found a good candidate to explore with LLMs.

Lesson 2: Set the right mindset from day one

Once I’m convinced that an LLM-based solution is appropriate for a specific use case, the next thing I’d do is to align on the right mindset with the domain experts.

One thing I find extremely crucial is the positioning of the tool. A framing I usually adopt that works very well in practice is this: the goal of our LLM tool is for augmentation, not automation. The LLM only helps you (i.e., domain experts) analyze faster, triage faster, and explore more, but you remain the decision-maker.

That difference matters a lot.

When you position the LLM tool as an augmentation, engineers tend to engage it with enthusiasm, as they see it as something that could make their work faster and less tedious.

On the other hand, if they sense that the new tool is something that may threaten their role or autonomy, they will distance themselves from the project and give you very limited support.

From a developer’s point of view (which is you and me), setting this “amplify instead of replacing” mindset also reduces anxiety. Why? Because it makes it much easier to talk about mistakes! When the LLM gets something wrong (and it will), the conversation won’t simply be “your AI failed.”, but it’s more about “the suggestion wasn’t quite right, but it’s still insightful and gives me some ideas.” That’s a very different dynamic.

Next time, when you are building LLM Apps for domain experts, try to emphasize:

LLMs are, at best, junior assistants. They are fast, work around the clock, but not always right.
Experts are the reviewers and ultimate decision-makers. You are experienced, cautious, and accountable.

Once this mindset is in place, you’ll see engineers start to evaluate your solution through the lens of “Does this help me?” rather than “Can this replace me?” That matters a lot in building trust and enhancing adoption.

Lesson 3: Co-design with experts and define what “better” means

Once we’ve agreed that LLMs are appropriate for the task at hand and the goal is augmentation not automation, the next critical point I’ll try to figure out is:

“What does better actually mean for this task?”

To get a really good understanding on that, you need to bring the domain experts into the design loop as early as possible.

Concretely, you should spend time to sit down with the domain experts, walk through how they solve the problem today, take notes on which tools they use, and which docs/specs they refer to. Remember to ask them to point out where the pain point really is, and better understand what is OK to be “approximate” and what types of mistakes are annoying or unacceptable.

A concrete outcome of these conversations with domain experts is a shared definition of “better” in their own language. These are the metrics you are optimizing for, which could be the amount of triage time being saved, the number of false leads being reduced, or the number of manual steps being skipped.

Once the metric(s) are defined, you’d automatically have a realistic baseline (i.e., whatever it takes by the current manual process) to benchmark your solution later.

Besides the technical effects, I’d say the mental effects are just as important: by involving experts early, you’re showing to them that you’re genuinely trying to learn how their world works. That alone goes a long way in earning trust.

Phase 2: During The Project

After setting up the stage, you’re now ready to build. Exciting stuff!

In my experience, there are a couple of important decisions you need to make to ensure your hard work actually earns trust and gets adopted. Let’s talk about those decision points.

Lesson 4: It’s Co-pilot, not Auto-pilot

A temptation I see a lot (also in myself) is the desire to build something “fully autonomous”. As a data scientist, who can really resist building an AI system that gives the user the final answer with just one button push?

Well, the reality is less flashy but far more effective. In practice, this “autopilot” mindset rarely works well with domain experts, as it fundamentally goes against the fact that engineers are used to systems where they understand the logic and the failure modes.

If your LLM app simply does everything in the background and only presents a final result, two things usually happen:

Engineers don’t trust the results because they can’t see how it got there.
They can’t correct it, even if they see something obviously off.

Therefore, instead of defaulting to an “autopilot” mode, I prefer to intentionally design the system with multiple control points where experts can influence the LLMs’ behavior. For example, instead of LLM auto-classifying all 500 alarms and creating tickets, we can design the system to first group alarms into 5 candidate incident threads, pause, show the expert the grouping rationale and key log lines for each thread. Then, experts could merge or split groups. After experts approve the grouping, the LLM can proceed to generate draft tickets.

Yes, from a UI perspective, this adds a bit of work, as you have to implement human-input mechanisms, expose intermediate reasoning traces and results clearly, and so on. But the payoff is real: your experts will actually trust and use your system because it gives them the sense that they are in control.

Lesson 5: Focus on workflow, roles, and data flow before picking a framework

Once we get into the implementation phase, a common question many developers (including myself in the past) tend to ask first is:

“Which LLM App framework should I use? LangGraph? CrewAI? AutoGen? Or something else?”

This instinct is totally understandable. After all, there are so many shiny frameworks out there, and it does feel like choosing the “right” one is the first big decision. But for prototyping with engineering domain experts, I’d argue that this is usually not the right place to start.

In my own experience, for the first version, you can go a long way with the good old from openai import OpenAI or from google import genai (or any other LLM providers you favor).

Why? Because at this stage, the most pressing question is not which framework to build upon, but:

“Does an LLM actually help with this specific domain task?”

And you need to verify it as quickly as possible.

To do that, I’d like to focus on three pillars instead of frameworks:

Pipeline design: How do we decompose the problem into clear steps?
Role design: How should we instruct the LLMs at each step?
Data flow & context design: What goes in and out of each step?

If you treat each LLM call as a pure function, like this:

inputs → LLM reasoning → output

Then, you can wire these “functions” together with just normal control flow, e.g., if/else conditions, for/while loops, retries, etc., which are already natural to you as a developer.

This applies to tool calling, too. If the LLM decides it needs to call a tool, it can simply output the function name and the associated parameters, and your regular code can execute the actual function and feed the result back into the next LLM call.

You really don’t need frameworks just to express the pipeline.

Of course, I’m not saying that you should avoid using frameworks. They are quite helpful in production as they provide observability, concurrency, state management, etc., out of the box. But for the early stage, I think it’s a good strategy to just keep things simple, so that you can iterate faster with domain experts.

Once you have verified your key assumptions with your experts, it’s not going to be difficult to migrate your pipeline/role/data design to a more production-ready framework.

In my opinion, this is lean development in action.

Lesson 6: Try workflows before jumping to agents

Recently, there has been quite a lot of discussion around workflows vs. agents. Every major player in the field seems eager to emphasize that they are “building agents,” instead of just “running predefined workflows.”

As developers, it’s very easy to feel the temptation:

“Yeah, we definitely should build autonomous agents that figure things out on their own, right?“

No.

On paper, AI agents sound super attractive. But in practice, especially in engineering domains, I’d argue that a well-orchestrated workflow with domain-specific logics can already solve a large fraction of the real problems.

And here is the thing: it does so with far less randomness.

In most cases, engineers already follow a certain workflow to solve that specific problem. Instead of letting LLM agents “rediscover” that workflow, it’s far better if you translate that “domain knowledge” directly into a deterministic, staged workflow. This immediately gives you a couple of benefits:

Workflows are way easier to debug. If your system starts to behave strangely, you can easily spot which step is causing the issue.
Domain experts can easily understand what you are building, because a workflow maps naturally to their mental model.
Workflows naturally invite human feedback. They can easily be paused, accept new inputs, and then resume.
You get much more consistent behavior. The same input would lead to a similar path or outcome, and that matters a ton in engineering problem-solving.

Again, I’m not saying that AI agents are useless. There are certainly many situations where more flexible, agentic-like behavior is justified. But I’d say always start with a clear, deterministic workflow that explicitly encodes domain knowledge, and validate with experts that it’s actually helpful. You can introduce more agentic behavior if you hit limitations that a simple workflow cannot solve.

Yes, it might sound boring. But your ultimate goal is to solve the problem in a predictable and explainable way that bring business values, not some fancy agentic demos. It’s good to always keep that in mind.

Lesson 7: Structure everything you can – inputs, outputs, and knowledge

A common perception of LLMs is that they are good at handling free-form texts. So the natural instinct is: let’s just feed reports and logs in and ask the model to reason, right?

No.

In my experience, especially in engineering domains, that’s leaving a lot of performance on the table. In fact, LLMs tend to behave much better when you give them structured input and ask them to produce structured output.

Engineering artifacts often come in semi-structured form already. Instead of dumping entire raw documents into the prompt, I find it very helpful to extract and structure the key information first. For example, for free-text incident reports, we can parse them into the following JSON:

medium

That structuring step can be done in various ways: we can resort to classic regexes, or develop small helper scripts. We can even employ a separate LLM whose only job is to normalize the free-texts into a consistent schema.

This way, you can give the main reasoning LLMs a clean view of what happened. And the bonus point is, with this structure in place, you can ask the LLMs to cite specific facts when reaching their conclusion. And that saves you quite some time in debugging.

If you’re doing RAG, this structured layer is also what you should retrieve over, instead of the raw PDFs or logs. You’d get better precision and more reliable citations when retrieving over clean, structured artifacts.

Now, on the output side, structure is basically mandatory if you want to plug the LLM into a larger workflow. Concretely, this means instead of asking:

“Explain what happened and what we should do next.”

I prefer something like:

“Fill this JSON schema with your analysis.”

{
  "likely_causes": [
    "name": "...", "confidence": "low
  ],
  "recommended_next_steps": [
    {"description": "...", "priority": 1}
  ],
  "summary": "short free-text summary for the human"
}

Usually, this is defined as a Pydantic model and you can leverage the “Structured Output” feature to explicitly instruct the LLMs to produce output that conforms to it.

I used to see LLMs as “text in, text out”. But now I see it more as “structure in, structure out”, and this is especially true in engineering domains where we need precision and robustness.

Lesson 8: Don’t forget about analytical AI

I know we are building LLM-based solutions. But as we learned in Lesson 1, LLMs are not the only tool in your toolbox. We also have the “old school” analytical AI models.

In many engineering domains, there is a long track record of applying classic analytical AI/ML methods to address various aspects of the problems, e.g., anomaly detection, time-series forecasting, clustering, classification, you name it.

These methods are still incredibly valuable, and in many cases, they should be doing the heavy lifting instead of being thrown away.

To effectively solve the problem at hand, many times it might be worth considering a hybrid approach of analytical AI + GenAI: analytical ML to handle the heavy-lifting of the pattern matching and detection, and LLMs operate on top to reason, explain, and recommend next steps.

For example, say you have thousands of incident events per week. You could start with using classical clustering algorithms to group similar events into patterns, maybe also compute some aggregate stats for each cluster. Then, the workflow can feed those cluster analytical results into an LLM and ask it to label each pattern, describe what it means, and suggest what to check first. Afterward, engineers review and refine the labels.

So why does this matter? Because analytical methods give you the speed, reliability, and precision on structured data. They’re deterministic, they scale to millions of data points, and they don’t hallucinate. LLMs, on the other hand, excels well at synthesis, context, and communication. You should use each for what it’s best at.

Phase 3: After You Have Built

You’ve built a system that works technically. Now comes the hardest part: getting it adopted. No matter how smart your implementation is, a tool that is put on a shelf is a tool that brings zero value.

In this section, I’d like to share two final lessons on integration and evaluation. You want to make sure your system lands in the real world and earns trust through evidence, right?

Lesson 9: Integrate where engineers actually work

A separate UI, such as a simple web app or a notebook, works perfectly fine for exploration and getting first-hand feedback. But for real adoption, you should think beyond what your app does and focus on where your app shows up.

Engineers already have a suite of tools they rely on every day. Now, if your LLM tool presents itself as “yet another web app with a login and a chat box”, you can already see that it will struggle to become part of the engineers’ routine. People will try it once or twice, then when things get busy, they just fall back to whatever they are used to.

So, how to address this issue?

I’d ask myself this question at this point:

“Where in the existing workflow would this app actually be used, and what would it look like there?”

In practice, what does this imply?

The most powerful integration is often UI-level embedding. That basically means you embed LLM capabilities directly into the tools engineers already use. For example, in a standard log viewer, besides the usual dashboard plots, you can add a side panel with buttons like “summarize the selected events” or “suggest next diagnostic steps”. This empowers the engineers with the LLM intelligence without interrupting their usual workflow.

One caveat worth mentioning, though: UI-level embedding often requires buy-in from the team that owns that tool. If possible, start building those relationships early.

Then, instead of a generic chat window, I’d focus on buttons with concrete verbs that match how engineers think about their tasks, be it summarize, group, explain, or compare. A chat interface (or something similar) can still exist if engineers have follow-up questions, need clarifications, or wish to input free-form feedback after the LLM produces its initial output. But the primary interaction here should be task-specific actions, not open-ended conversation.

Also important: you should make the context of LLMs dynamic and adaptive. If the system already knows which incident or time window experts are looking at, pass that context directly to the LLM calls. Don’t make them copy-paste IDs, logs, or descriptions into yet another UI.

If this integration is done well, the barrier to trying it (and ultimately adopting it) would become much lower. And for you as a developer, it’s much easier to get richer and more honest feedback as it’s tested under real conditions.

Lesson 10: Evaluation, evaluation, evaluation

Once you have shipped the first version, you might think your work is done. Well, the truth is, in practice, that is exactly the point where the real work starts.

It’s the beginning of the evaluation.

There are two things I want to discuss here:

Make the system show its work in a way that engineers can inspect.
Sit down with experts and walk through real cases together.

Let’s discuss them in turn.

First, make the system show its work. When I say “show its work”, I don’t just mean a final answer. I want the system to expose, at a reasonable level of detail, three concrete things: what it looked at, what steps it took, and how confident LLMs are.

What it looked at: those are essentially the evidence LLMs use. It’s a good practice to always instruct LLMs to cite specific evidence when they produce a conclusion or recommendation. That evidence can be the specific log lines, the specific incident IDs, or spec sections that support the claim. Remember in Lesson 7, we talked about structured input? That comes in handy for LLM citation management and verification.
What steps did it take: those refer to the reasoning trace produced by LLMs. Here, I’d expose the output produced in key intermediate steps of the pipeline. If you’re adopting a multi-step workflow (Lessons 5 & 6), you would already have these steps as separate LLM calls or functions. And if you’re enforcing structured output (Lesson 7), surfacing them on UI becomes easy.
How confident LLMs are: finally, I almost always ask the LLM to output a confidence level (low/medium/high), plus a short rationale on why assigning this confidence level. In practice, what you would obtain is something like this: “The LLM said A, based on B and C, with medium confidence because of D and E assumptions.” Engineers are much more comfortable with that kind of statement, and again, this is a crucial step towards building trust.

Now, let’s go to the second point: evaluate with experts using real cases.

My suggestion is, once the system can properly show its work, you should schedule dedicated evaluation sessions with domain experts.

It’s like doing user testing.

A typical session could look like this: you and the expert pick a set of real cases. These can be a mix of typical ones, edge cases, and a few historical cases with known outcomes. You run them through the tool together. During the process, ask the expert to think aloud: What do you expect the tool to do here? Is this summary accurate? Are these suggested next steps reasonable? Would you agree that the cited evidence actually supports the conclusion? Meanwhile, remember to take detailed notes on things like where the tool clearly saves time, where it still fails, and what important context is currently missing.

After a couple of sessions with the experts, you can tie the results back to the “better” we defined earlier (Lesson 3). This doesn’t have to be a “formal” quantitative evaluation, but trust me, even a handful of concrete before/after comparisons can be eye-opening, and give you a solid foundation to keep iterating your solution.

Conclusion

Now, looking back at those ten lessons, what recurring themes do you see?

Here is what I see:

First, respect the domain expertise. Start from how domain engineers actually work, genuinely learn their pain points and wishes. Position your tool as something that helps them, not something that replaces them. Always let experts stay in control.

Second, engineer the system. Start with simple SDK calls, deterministic workflows, structured inputs/outputs, and mix traditional analytics with the LLM if that makes sense. Remember, LLMs are just one component in a larger system, not the entire solution.

Third, treat deployment as the beginning, not the end. The moment you deliver the first working version is when you can finally start having meaningful conversations with experts. Walking through real cases together, collecting their feedback, and keeping iterating.

Of course, these lessons are just my current reflections of what seems to work when building LLM applications for engineers, and they are certainly not the only way to go. Still, they’ve served me well, and I hope they can spark some ideas for you, too.

Happy building!