When the Uncertainty Is Bigger Than the Shock: Scenario Modelling for English Local Elections

Contents

What was modelled Method: backtest errors as the empirical uncertainty distribution The result: shocks smaller than uncertainty Reading the dashboard: geography and rankings Two asymmetric scenarios, two design lessons Reproducibility and limitations What scenario analysis teaches us What happens after May 7

Across 64 English authorities and six 2026 scenarios, even the strongest scenario shock was only 13% of the median uncertainty band.

In plain English: the model’s assumptions moved the result less than historical forecast error did. The most aggressive challenger surge I could parameterise sits inside the noise the model has produced in past elections. That is not a defect. It is the result.

I built this scenario model expecting clean separation between assumptions. I expected S3, the challenger surge, to dominate. I expected rankings I could defend. What I got was an envelope where the strongest shock sits inside calibrated uncertainty, and where rankings dissolve when intervals are plotted on top of them.

This is the second instalment of a project on English local electoral data. Part 1 corrected a categorical-normalisation bug that reversed the original headline. Part 2 picks up where the corrected baseline ends and asks a different question: given the historical churn we now measure correctly, what 2026 scenarios are worth modelling, and how should we read them when uncertainty is wider than the shocks?

What was modelled

The 2026 English local elections are scheduled for Thursday 7 May 2026. This project covers 64 active authorities holding elections that day: 32 London boroughs, 27 metropolitan boroughs, and 5 West Yorkshire authorities. Six scenarios apply different assumptions to the same historical baseline. Four metrics are computed for each scenario × authority combination: volatility_score, delta_fi, swing_concentration, and turnout_delta. The model produces 1,536 output rows, each with a point estimate plus calibrated P10, P50, and P90 values from 2,000 draws of the empirical error distribution.

Scenario	Question	Main assumption
S0	What if no new swing is applied?	Historical uncertainty only
S1	What if 2018-2022 challenger patterns continue?	Continuation of recent challenger churn
S2	What if major parties partially recover?	Establishment recovers half lost share
S3	What if challengers surge harder?	Stress test: +4pp challenger surge
S4	What if deprivation-linked turnout rises?	+3pp turnout in IMD deciles 1-3
S5	What if London volatility is capped by history?	London P90 upper-tail cap

Each scenario is a controlled perturbation. Labels describe assumptions, not outcomes. The full interactive dashboard is on Tableau Public.

Two definitions to carry through the rest of the article: scenario shock is the movement in the scenario point estimate relative to the baseline. Uncertainty width is the P10-to-P90 interval calibrated from historical forecast error. The 13% headline is the first divided by the second.

Method: backtest errors as the empirical uncertainty distribution

Backtest errors are not just a scorecard. They can become the empirical uncertainty distribution for future scenario analysis.

The standard use of a backtest is pass/fail. Did the predictions match held-out reality? That answers whether the model worked, but it leaves the residuals on the floor.

A second use treats those residuals as a distribution. How wrong has the model been across boroughs and cycles, in what direction, with what spread? The answer becomes the empirical sample from which future uncertainty bands are drawn. Predictive bands stop being parametric assumptions about how errors should behave. They are bootstrapped from how errors actually have behaved.

This model uses backtests in the second sense. Tier-level mean-centered historical error pools from the 2014→2018 training window and the 2018→2022 backtest form the bootstrap pool from which 2026 uncertainty bands are sampled. In practical terms: the model is asking how much movement would count as genuinely unusual relative to the noise it has produced before.

Two design choices shape the calibration.

Errors are pooled at the tier level, not at the borough level. Each borough has 1-2 prior observations, which is too noisy to characterise a residual distribution. Pooling at the tier level (London, Metropolitan, West Yorkshire) keeps a sample large enough to be informative while preserving the structural distinction between geographies that have historically behaved differently.

Errors are mean-centered before sampling. This separates historical bias from uncertainty dispersion. Without centering, S0’s P50 would drift away from zero because of historical mean error, mixing the model’s track record of being slightly off into the median of the band. After centering, the band represents dispersion around the scenario assumption rather than dispersion around the model’s bias.

One nuance worth flagging: mean-centering removes average historical bias but does not force the bootstrap median to equal the point estimate. When residual pools are skewed or bounded (swing_concentration has a lower bound of 1.0), the P50 can still sit slightly off the assumption. Reporting P10/P50/P90 separately, rather than mean ± standard deviation, keeps that asymmetry visible.

The 2,000 draws produce stable percentile estimates while keeping the full output under 10,000 rows for clean Tableau ingestion.

Data science takeaway: Backtest errors are not just a scorecard. They can become the empirical uncertainty distribution for future scenario analysis, calibrating bands that reflect how the model has actually been wrong.

The result: shocks smaller than uncertainty

Three numbers carry the finding:

S3 challenger surge: 13% of the median volatility interval.
S1 volatility continuation: 6%.
S2 establishment recovery: 5%.

Each number is the scenario shock divided by the median P10-to-P90 band width across the 64 active authorities. The strongest shock, a +4pp challenger surge, moves the central estimate by about one-eighth of the historical noise the model has produced in past cycles.

The result I least expected is the most important one: the scenarios are less separated than the uncertainty bands. If this were a forecast dashboard, that would be disappointing. For a scenario analysis, it is the point.

Figure 1: IntervalBands. Filter context: Scenario = S3; Sort = Uncertainty band width; Metric locked to volatility_score. Each row is one authority. Bar = P10-P90 band. White dot = P50. The inset reports each scenario shock as a percentage of the median band width.

How to read the chart: each horizontal bar is one authority’s calibrated uncertainty interval. The white dot inside it is the calibrated median. The bar’s colour is geographic, not analytical (teal = London, amber = Metropolitan, slate = West Yorkshire). The amber rings showing each scenario’s point estimate are visible on the rankings panel (Figure 2b); in Figure 1 they are summarised in the inset percentages.

Across 64 authorities and the three active scenarios, the point estimate nearly always sits inside the bar. The shock perturbs the model less than the model has historically perturbed itself.

Part 1 reported that the correlation between turnout change and volatility was statistically null (r = -0.12, p = 0.35). Part 2 finds that scenario shocks are similarly smaller than the uncertainty around them. The pattern is the same: when the magnitude of an effect is comparable to or smaller than the noise, ranking the effects creates false precision. Effect-vs-uncertainty determines whether a result should be interpreted as signal or context.

The dashboard does not say “S3 wins.” It says S3 moves the envelope most while still sitting inside broad empirical uncertainty. “Wins” implies the model has chosen between scenarios. It has not. One scenario perturbs the central estimate slightly more than the others; the band around all three remains wide enough to absorb the difference.

Data science takeaway: Always compare effect size to uncertainty width. A scenario shock that looks large in isolation may be small relative to historical error.

Reading the dashboard: geography and rankings

Two views translate the headline into geographic and ranked context.

The map shows uncertainty footprint for one scenario at a time. Colour encodes P50 under the selected scenario; size encodes interval width. The widest bands are not exclusively in London. Metropolitan boroughs in the North East, North West, and West Yorkshire show interval widths comparable to the densest London cluster.

The rankings view is where the effect-vs-uncertainty comparison becomes hardest to ignore. Each row shows three marks: the bar (P10-P90), the white dot (P50), and the amber ring (scenario point estimate). The amber ring nearly always sits inside the bar, which means the scenario shock is smaller than the historical uncertainty even for the authorities ranked at the top.

**Figure 2b: Rankings.** *Filter context: Scenario = S3; Metric = Volatility score; Sort = Uncertainty band width.* Top-15 authorities. Switching the sort to P50 or scenario shock reorders the ranking, and the rings still sit inside the bars.

Rankings of uncertain estimates need their intervals shown alongside them. A ranked list without uncertainty invites false precision: the reader sees Authority A above Authority B and assumes the model is confident about the order. When the bands overlap, as they do at every level of these rankings, that confidence is unwarranted.

Two asymmetric scenarios, two design lessons

Two of the six scenarios behave differently from the rest. S4 and S5 do not run on the same vote-share-perturbation logic as S1, S2, and S3, and the difference makes them useful design demonstrations beyond the election context.

S4 lesson: isolate one mechanism at a time.

S4 tests a hypothesis from UK turnout literature: that elections in more deprived authorities can show turnout shifts when local salience changes. It applies a +3 percentage point turnout shock to authorities falling in IMD deciles 1-3 under the LAD-level Index of Multiple Deprivation (IMD 2019) overlay. 41 of the 64 active authorities receive the shock; 23 do not. The tier split: 13 of 32 London boroughs, 23 of 27 metropolitan boroughs, all 5 West Yorkshire authorities. Within this scenario scope, the shock concentrates among Metropolitan and West Yorkshire authorities more than among London boroughs.

**Figure 3: Caveats.** *Filter context: No user-selectable parameters. Both panels show pre-locked scenario logic.* Top: S4 tier split. Bottom: S5 cap. Maximum London S5 P90 = 16.7. Cap = 39.45. Binding events = 0.

Vote-share metrics (fragmentation, volatility, swing concentration) are copied from S0 unchanged under S4. The scenario is turnout-only by construction.

That construction is the design lesson. By keeping S4 to a single perturbation channel, the assumption is falsifiable on its own terms. If observed 2026 turnout shifts in IMD-1-to-3 authorities are not in the +3pp range, the assumption fails without dragging the vote-share story with it. A scenario that perturbs three mechanisms simultaneously is harder to learn from when reality disagrees with it. You cannot tell which assumption broke.

S5 lesson: log guardrails even when they do not bind.

S5 caps the upper tail of London volatility_score at 39.45. The cap is the empirical 90th percentile of historical London borough volatility across the training and backtest windows: 64 London borough observations (32 from training, 32 from backtest, City of London excluded because it sits outside the 32-borough London electoral scope). The cap is one-sided, applies only to London, and constrains the P90 only.

In the frozen run, the maximum London S5 P90 is 16.70. That is 42% of the cap, with 22.75 units of headroom. The cap binds zero times.

S5 is a guardrail, not an adjustment. It would have constrained the upper tail of London volatility if any borough had exceeded historical levels. None did. The value lies in being logged. A stress test that does not bind is still useful provenance: it shows the analyst considered the failure mode, parameterised the constraint from data, and reported that the constraint was inactive. Removing the cap from the documentation because it did not fire would erase the analytical decision that was made.

Reproducibility and limitations

The model is frozen, seeded, hashed, and reproducible from the repository. Re-running src/civic_lens/scenario_model.py against the locked commit reproduces the output bit-for-bit.

**Figure 4: Provenance.** *Filter context: No user-selectable parameters; all values are model-lock outputs from the frozen run.* Frozen 2026-05-01 00:13:56 UTC. Model SHA b795a07. Output hash sha256:522fd6bdc5f3… 0 validation failures, 0 ordering violations, 0 small-pool events. RNG seed 20260430. 2,000 draws per scenario × authority × metric.

One known limitation is documented on the dashboard alongside the result. The training window predates Reform UK’s 2025-2026 expansion, so right-wing challenger volatility may be understated under a hypothesis where Reform behaves differently from prior insurgent parties at scale.

All underlying data is openly licensed: election results from the DCLEAPIL v1.0 dataset (Leman 2025, CC BY-SA 4.0); turnout and 2022 cross-checks from the Commons Library local elections dataset (Open Parliament Licence v3.0); deprivation and geography from ONS / MHCLG (OGL v3). The pipeline code in the Civic Lens repository is MIT-licensed; derived data are published with source attribution and remain subject to upstream licences.

Data science takeaway: A model is more trustworthy when its outputs are frozen, hashed, and reproducible. Provenance is part of the analysis. Limitations should be visible on the same screen as the headline number.

What scenario analysis teaches us

The transferable skill is not election modelling. It is building scenario systems where assumptions are visible, uncertainty is calibrated against historical error, and effect sizes are reported alongside the noise that surrounds them. The same pattern shows up in demand forecasts under price-change scenarios, public health policy stress tests, and risk models where regulator-imposed shocks are smaller than realised market volatility. Rank scenarios without showing the uncertainty around them and you produce false precision. That is the trap.

The model does not say what will happen in May 2026. It says what would be surprising relative to calibrated uncertainty. Three things to watch on results night and the days after:

Whether challenger surges exceed the S3 envelope. If realised volatility in challenger-active boroughs exceeds the S3 P90 bands shown on the dashboard, the calibrated band has been breached and the model needs retraining. This is the most likely place for the model to break, because Reform UK’s post-2024 trajectory is unprecedented in the training window.
Whether London volatility breaches the historical upper-tail cap. The S5 cap of 39.45 is the empirical 90th percentile across 64 historical London observations. A single 2026 borough exceeding it would clear the historical upper-tail threshold. Two or more would be a meaningful break with the historical distribution.
Whether deprivation-linked turnout shifts materialise in the direction S4 assumes. A clean test of one isolated mechanism, with vote-share metrics held constant. If turnout in IMD-1-to-3 authorities does not move in the +3pp range, the S4 hypothesis fails on its own terms.

What happens after May 7

The model is already frozen. The hashes, RNG seed, and code commit shown on the provenance dashboard cannot change between now and election night. Whatever the calibrated bands say today is what they will say when realised results land.

Part 3 of this series will be a public accuracy audit. Frozen scenario outputs will be tested against actual 2026 borough-level results. Coverage rates (did P10-P90 contain the realised value?), mean absolute error, ranking quality, and any systematic misses will all be reported, including the failures. The methodology caveat about Reform UK is the most likely failure mode; we will see whether the bands held.

That is what the freeze enables. The “three things to watch” above are not rhetorical. They are the falsification criteria for an uncertainty model published before its data existed.

The most honest result is not a prediction. It is a warning about precision. The scenarios move the envelope, but historical uncertainty is still wider than the shocks.

For data scientists, that may be the main lesson: scenario analysis is most useful when it resists becoming a forecast.

The full interactive dashboard is published on Tableau Public. The pipeline, scenario model code, calculated fields, and Tableau build guide are open-source at github.com/Wisabi-Analytics/civic-lens.

Obinna Iheanachor is a Senior AI/Data Engineer and founder of Wisabi Analytics, a UK-based data engineering and AI consultancy. He creates content around production AI systems, data pipelines, and applied analytics at @DataSenseiObi on X and Wisabi Analytics on YouTube. Civic Lens is an open-source political data project at github.com/Wisabi-Analytics/civic-lens.