Why I Don’t Trust LLMs to Decide When the Weather Changed

Contents

The problem with “vibe-based” deltas My path to this problem The architecture The Deterministic Gatekeeper The Trigger The Narrator Why this architecture is testable Event-Driven LLM Invocation A concrete example When this pattern breaks Closing

have a simple problem: they show you the forecast, but they don’t tell you when it actually changed.

That might sound trivial. It isn’t.

Modern numerical weather prediction (NWP) systems — like ECMWF IFS — produce remarkably accurate forecasts at ~9 km resolution, updated every few hours. The data is already very good.

The problem is not the forecast.

The problem is attention: knowing when a change in that data is actually meaningful.

I didn’t learn that from software engineering. I learned it years earlier, studying chaos theory at the Instituto Balseiro. It was there, working through dynamical systems, that I first encountered a slightly unsettling idea:

A system can be completely deterministic and still be practically unpredictable.

That idea stayed with me. And years later, when I started building AI systems, I realized that many of them were ignoring it.

The problem with “vibe-based” deltas

When I started seeing how developers were building weather agents, I noticed a pattern:

Fetch forecast data
Feed it into an LLM
Ask: “Did the weather change significantly?”

At first glance, this seems reasonable. From a physics perspective, it is problematic — at least for problems where the decision boundary is already well-defined — because it replaces a well-defined threshold with a probabilistic interpretation.

In a chaotic system, significance is not a linguistic judgment — it is a threshold defined on variables like temperature, precipitation, or wind speed. It depends on magnitudes, context, and time horizons.

An LLM is a stochastic process. It is very good at generating language, but it is not designed to enforce deterministic boundaries on physical systems.

When you ask an LLM whether a forecast “changed significantly,” you’re asking a probabilistic model to approximate a deterministic rule that you could have defined explicitly. That introduces variability exactly where you want consistency.

The failure modes are subtle:

Trends inferred from phrasing rather than data
Inconsistent decisions across similar inputs
Outputs that cannot be tested or reproduced

In many applications, that might be acceptable. In agriculture, energy, and logistics — where a 3°C drop is a phase transition for a crop, a non-linear spike in energy demand, or an operational disruption — it is not. These decisions need to be stable and explainable.

Which led me to a simple rule:

If you can write an assert statement for it, you probably shouldn’t be using a prompt.

My path to this problem

My career has looked less like a straight line and more like a trajectory in phase space. A Marie Curie PhD in climate dynamics, five years directing R&D at Uruguay’s national meteorology institute — forest fire prevention, seasonal forecasting, climate adaptation — then a shift to production ML at Microsoft and Mercado Libre.

That arc gave me something specific: I already understood the physics of the data, the skill horizons of the models, and what “significant change” actually means in a physical system. Not as a software abstraction — as a measurable delta on a variable with known uncertainty bounds.

When I started building AI systems, the instinct was immediate: this is a threshold problem. Thresholds belong in code, not in prompts.

Skygent is one expression of that perspective — an agent designed not to display forecasts, but to detect meaningful changes in them.

The system runs continuously on real forecast data for user-defined events, evaluating changes every few hours and only triggering alerts when predefined conditions are met. In practice, most evaluation cycles result in no alert — only a small fraction of changes cross the significance threshold. That’s the point: signal, not noise.

The architecture

Skygent follows a clean separation across five layers:

Architecture description

Only one layer calls the LLM.

The Deterministic Gatekeeper

At the core is a Python evaluator. It doesn’t interpret — it calculates. It:

Compares consecutive Pydantic-validated forecast snapshots
Evaluates deltas against configurable thresholds
Incorporates context: event type, variable sensitivity
Accounts for forecast horizon using established NWP skill limits — a change in a 24-hour forecast does not carry the same reliability as a change in a 10-day forecast

This is where decisions are made. Every alert has a traceable path: which variable changed, by how much, which threshold was crossed. In a corporate or government environment, being able to explain why an alert fired — without saying “the model felt like it” — is not optional.

The Trigger

An alert fires only if a threshold is crossed. If the delta doesn’t cross the boundary, nothing happens. This is a binary, testable condition — not a judgment call.

The Narrator

Only after the decision is made does the LLM enter the pipeline. Its role is strictly limited: take structured JSON data, translate it into natural language.

# Structured payload sent to GPT-4o-mini
{
    "event_name": "Ana's Wedding",
    "variable": "precipitation_probability_max",
    "from_value": 10.0,
    "to_value": 50.0,
    "delta": 40.0,
    "horizon_days": 5.2,
    "confidence": "medium"
}

Output:

“Rain probability increased from 10% to 50% for your event window. Confidence is medium due to the 5-day forecast horizon.”

The LLM is not deciding anything. It is explaining.

Why this architecture is testable

It is practically impossible to reach 100% test coverage on a pure LLM agent — you cannot write deterministic assertions on probabilistic outputs.

The hybrid approach changes this. The decision logic is pure Python with Pydantic-validated inputs: 204 unit tests, zero LLM dependencies in the test suite. The LLM handles only the narrative tone — the one thing that genuinely benefits from natural language generation.

This is not just a testing convenience. It means every decision the
system makes can be explained, reproduced, and verified independently of the LLM.

Event-Driven LLM Invocation

A naive agent calls the LLM on every polling cycle. This one doesn’t.

Skygent evaluates every 6 hours. It only calls the model when a threshold is crossed — roughly once or twice per week per monitored event, compared to ~28 calls for a naive polling agent.

At gpt-4o-mini pricing (~$0.0001 per narrative), cost is negligible. More importantly, cost is proportional to actual information: you pay for an LLM call only when something worth communicating happened.

A concrete example

Previous snapshot: Rain probability 10%, Max temp 22°C, Wind 15 km/h

Current snapshot: Rain probability 50%, Max temp 21.4°C, Wind 18 km/h

Threshold: Alert if rain probability Δ > 20pp

Evaluation frequency: Every 6 hours

Result: Alert triggered → GPT-4o-mini generates narrative → Telegram delivery

When this pattern breaks

This approach doesn’t apply everywhere. It breaks down when:

Inputs are unstructured or ambiguous
Decision boundaries cannot be codified as thresholds
Reasoning is open-ended

In those cases, LLM-first architectures — ReAct, Plan-and-Execute — make more sense.

One honest caveat: the thresholds in Skygent are configurable defaults — reasonable starting points informed by meteorological practice, but not calibrated against historical forecast errors for specific use cases. Calibration against real outcomes is the natural next step for any vertical deployment. The pattern is sound; the parameters are a starting point.

Closing

The most important decision I made building this system was not choosing a model or a framework.

It was deciding where not to use an LLM.

There is a tendency right now to delegate more and more to language models — to let them figure things out. But some problems already have structure. Some decisions already have boundaries.

When they do, approximating them with language is the wrong move. Encoding them explicitly is better.

In practice, this often comes down to a simple distinction: use LLMs to explain decisions, not to replace well-defined ones.

The full implementation — significance evaluator, LangGraph pipeline, Telegram bot — is available at: github.com/ferariz/skygent

Fernando Arizmendi builds production AI systems at the intersection of rigorous scientific method and applied AI engineering. He is a physicist (B.Sc. & M.Sc.) from Instituto Balseiro, former Marie Curie fellow (Ph.D. studying Climate Dynamics & Complex Systems), and previously directed R&D at Uruguay’s national meteorology institute.

LinkedIn · GitHub

All images by the author. Pipeline diagram generated with Claude (Anthropic).