day-ahead electricity-demand forecasting on US balancing authorities · pushed to its scaling-law floor · with honest comparisons against what it replaces
We forecast hourly electricity demand a day ahead for US balancing authorities (BAs) — the entities that have to physically match generation to load in real time. The headline number is 1.02% MAPE on seven BAs the model never trained on, scored once on 2025. The same task by each BA's own operational forecast (what utilities run today, published in EIA-930) is 9.04% on the identical window. Independently, a controlled head-to-head against the strongest recent published benchmark [Hong & Lee 2026] on its own protocol comes in best-of-three with small but CI-separated margins.
| configuration | MAPE % | what it is |
|---|---|---|
| GridFloor (ours, multi-seed) | 1.03 | L × 80ep × XBRL, 5-seed mean ±0.01 |
| incumbent (EIA-930 native) | 9.04 | each BA's own published day-ahead forecast, same eval |
| fitted scaling-law floor E∞ | 0.85 | where the curve flattens (~0.1pp below us) |
| documented dead ends | 13 | interventions that didn't move the metric |
New to this? One-page primer — the problem, the result, the glossary (BA, EIA-930, MAPE, foundation models), why it matters.
22 runs with dates, motivation, what broke. The progress chart + every contender on one identical eval. Anti-pollution guarantees.
How GridFloor sits against ~25 external / operational results, tagged by what's actually comparable. Includes the controlled head-to-head.
Interactive: real demand data, forecast + intervals, dispatch-value bars, the architecture diagram, the system DAG.
Pipeline as a node graph. Click any node for a drawer with what it is, what the number means, the script behind it.
Compact one-screen TL;DR — SOTA, scaling law, three findings, the graveyard.
The specific human calls that steered the autonomous agents — and which ones changed the result.
15-page writeup: intro, related work, controlled head-to-head, three findings, system diagram, discussion, 24 citations.
It beats what utilities run today. Grid operators commit generation a day ahead against a load forecast; under- and over-forecasting both cost money (peakers vs. held spinning reserves). The 1.02% vs 9.04% gap is 3–4× better even on the incumbent's cleanest BAs (BPAT 0.55 vs 2.0, DUK 0.64 vs 2.5, TVA 0.68 vs 2.2). Lower error means fewer reserve plants kept "just in case," lower cost, lower emissions.
The accuracy converts to operational value. A natural worry is that 1% vs 6% MAPE is academic. We tested it with a 4-hour battery: under both peak-shaving and price-arbitrage dispatch, the forecast captures 86–89% of the perfect-foresight value that 24-hour persistence leaves on the table. Getting the daily peak hour right — which weather shifts day to day — is worth real money.
Finding the right inductive bias. A 2.2M-parameter purpose-built multivariate transformer beats 200M+ time-series foundation models (Chronos-2, Sundial, TabPFN-TS) by 2–3pp. Load forecasting is fundamentally cross-region — neighboring BAs co-move under shared weather and economic cycles — and the permutation-invariant iTransformer treats every BA as a cross-attention token. Univariate foundation models can't encode that, and no amount of scale recovers it.
Quantifying the ceiling. A Chinchilla-style fit over a 4×4 size×data sweep puts the floor at E∞ ≈ 0.85% and shows the model-size axis is dead while the data axis still has slope (β ≈ 0.5). That turns "should we keep trying?" into a number, and says the remaining gain is a data-acquisition question, not a modeling one.
A discipline that refuted itself. The most exciting interim finding — "MAPE is operationally worthless" — died when we required a committed recompute before publishing. Under two dispatch models the forecast clearly pays. The rule "no result counts until it reproduces from a committed script" is what makes every number on this site defensible.
I set direction in plain language and an orchestrating agent decomposed each call into briefs for specialized sub-agents that scraped data, wrote training and eval scripts, dispatched GPU jobs, ran paired-bootstrap tests, and committed results with SHIP/KEEP/DROP verdicts. The directions page traces the specific calls that shaped the outcome — multi-seed everything, recompute the dispatch finding from a committed script, match the eval to prior SOTA exactly.
All numbers reproduce from committed scripts. Eval protocol held fixed throughout: 8 holdout BAs never trained, Sep–Oct 2024 validation, one-shot 2025 test, seed 42 with multi-seed confirmation for headline claims, MAPE on observed>0.