gridfloor

interactive walk-through · real EIA-930 demand + real model outputs · 2025 hold-out · hover the charts

1 · What the data looks like

Each balancing authority (BA) reports its electricity demand every hour. The job: given the past week, predict the next day. Below is two real weeks (July 2025) of hourly demand. Pick a BA — notice every one has a strong daily cycle (overnight trough, daytime peak) and a weekly cycle (weekends are lower).

What to notice. The shape is extremely regular — which is exactly why this is hard to improve. A model that just repeats last week is already ~90% right. The whole game is the last few percent: weather-driven shifts in the height and timing of the daily peak.

2 · What the model predicts

Now overlay the model's forecast (green) and its 80% prediction interval (shaded). The forecast tracks the actual so tightly the lines nearly coincide — that's what ~1% MAPE looks like. Toggle the interval and the error trace; hover any hour to read exact values.

What to notice. The error trace (×10 so you can see it) is mostly flat noise — except around the daily peaks, where it spikes. The model's residual error lives almost entirely in peak height & timing. On NEVP you can also see two spikes to zero: those are EIA-930 telemetry dropouts (the meter, not the model, failed) — exactly the artifacts our LDWP cleaning rule masks.

3 · What does "1.03% MAPE" actually mean?

MAPE = mean absolute percentage error. 1.03% on the held-out 2025 BAs means: on a typical 100 GW grid, the average hourly miss is ~1 GW — roughly one large power plant. For comparison, naively repeating yesterday (persistence-24h) misses 6.2%, and last week 9.9%.

1.03%

SOTA (this model)

6.22%

repeat yesterday

9.87%

repeat last week

~0.85%

scaling-law floor E∞

How to think about it. We're ~0.1pp above the achievable floor for this architecture. The fitted scaling law says no amount of extra parameters or epochs reaches lower — only more in-distribution data, and that's nearly exhausted. So "can we get to 0.9%?" is a data-acquisition question, not a modeling one.

3½ · vs the real incumbent — same eval, head to head

"Better than baseline" only means something if the comparison is fair. The strongest prior SOTA isn't a paper — it's each BA's own operational day-ahead forecast, which EIA-930 publishes alongside actuals. We score it on the identical protocol: same 7 of 8 holdout BAs (LDWP excluded for telemetry artifacts), same 2025-Feb–Dec window, same MAPE-on-(obs>0) definition, same one-shot split. Our model wins on every BA.

Why this is the right comparison. Citing a paper's MAPE is apples-to-oranges — different BAs, horizon, test year, and even MAPE definition (some use sMAPE/WAPE). The incumbent operational forecast sidesteps all of that: it predicts the same series we do, for the same hours, and is what a utility actually dispatches against today. On its well-behaved BAs (BPAT, DUK, TVA at ~2–2.5%) we're 3–4× better; its high values (PSEI 29%, AZPS 11%) partly reflect gaps in the native forecast feed itself.

4 · Does the accuracy actually matter? (battery dispatch)

A natural worry: maybe 1% vs 6% is academic — if a battery operator just needs the daily shape, persistence already gives that. We tested it. A 4-hour battery charges in the cheapest hours and discharges in the priciest, scheduled by each signal; value is measured against true prices. Accuracy pays.

The reframing. An earlier (uncommitted) analysis claimed the forecast added only 0.8% of dispatch value over persistence — i.e. "MAPE is useless." A clean recompute refuted it: under both dispatch models the forecast captures 86–89% of the value persistence leaves on the table. Getting the exact daily peak hour right — which weather moves day to day — is worth a lot.

5 · Why don't foundation models win?

Time-series foundation models (Chronos-2, Sundial, TabPFN-TS) are the obvious thing to try. They can't compete — even fine-tuned on this exact data they plateau near 3–4%.

Why. Foundation TS models are univariate — one series at a time. Our model ingests all 42 BAs jointly as cross-attention tokens, so it exploits the cross-BA correlation (when the Southwest bakes, neighbors move together). That inductive bias is the entire game; a 200M-param univariate model can't recover what a 2.2M-param multivariate one gets for free.

Here's the structural difference, drawn. Left: our permutation-invariant iTransformer — every BA is a token, and cross-attention lets each one read every other (hover a token to see its links). Right: a univariate foundation model sees one series in isolation.

perm-iTransformer (ours) · 2.2M · 1.03%

foundation model · 200M · 3.3–4.0%

6 · The graveyard — 13 things that didn't work

The single most useful artifact of the project: a documented list of what fails, since negatives rarely get published. The throughline — the model already extracts everything from demand history + cross-BA structure; redundant signal doesn't help, and mismatched data hurts.