gridfloor

day-ahead electricity-demand forecasting on US balancing authorities · pushed to its scaling-law floor · with honest comparisons against what it replaces

We forecast hourly electricity demand a day ahead for US balancing authorities (BAs), the entities that have to physically match generation to load in real time. The headline number is 1.02% MAPE on seven BAs the model never trained on, scored once on 2025. The same task by each BA's own operational forecast (what utilities run today, published in EIA-930) is 9.04% on the identical window. Independently, a controlled head-to-head against the strongest recent published benchmark [Hong & Lee 2026] on its own protocol comes in best-of-three with small but CI-separated margins.

configuration	MAPE %	what it is
GridFloor (ours, multi-seed)	1.03	L × 80ep × XBRL, 5-seed mean ±0.01
incumbent (EIA-930 native)	9.04	each BA's own published day-ahead forecast, same eval
fitted scaling-law floor E∞	0.85	where the curve flattens (~0.1pp below us)
documented dead ends	13	interventions that didn't move the metric

The result that matters most isn't the 1%, it's that the model beats each balancing authority's own operational day-ahead forecast on every BA, scored identically, and that the accuracy translates into 86–89% of perfect-foresight battery-dispatch value above the persistence floor.

Pages

cheatsheet

New to this? One-page primer covering the problem, the result, the glossary (BA, EIA-930, MAPE, foundation models), and why it matters.

experiments

22 runs with dates, motivation, what broke. The progress chart + every contender on one identical eval. Anti-pollution guarantees.

literature

How GridFloor sits against 32 external / operational results, tagged by what's actually comparable. Includes the controlled head-to-head.

explorer

Interactive: real demand data, forecast + intervals, dispatch-value bars, the architecture diagram, the system DAG.

flowchart

Pipeline as a node graph. Click any node for a drawer with what it is, what the number means, the script behind it.

scoreboard

Compact one-screen TL;DR: SOTA, scaling law, three findings, the graveyard.

directions

The specific human calls that steered the autonomous agents, and which ones changed the result.

paper (PDF) ↗

16-page writeup: intro, related work, controlled head-to-head (incumbent + 7-FM table), three findings, dispatch dollars at 2/4/8h durations, system diagram, discussion, 26 citations.

Why this is useful in the real world

It beats what utilities run today. Grid operators commit generation a day ahead against a load forecast; under- and over-forecasting both cost money (peakers vs. held spinning reserves). The 1.02% vs 9.04% gap is 3–4× better even on the incumbent's cleanest BAs (BPAT 0.55 vs 2.0, DUK 0.64 vs 2.5, TVA 0.68 vs 2.2). Lower error means fewer reserve plants kept "just in case," lower cost, lower emissions.

The accuracy converts to operational value. A natural worry is that 1% vs 6% MAPE is academic. We tested it with a 4-hour battery: under both peak-shaving and price-arbitrage dispatch, the forecast captures 86–89% of the perfect-foresight value that 24-hour persistence leaves on the table. Getting the daily peak hour right, which weather shifts day to day, is worth real money.

What was technically hard

Finding the right inductive bias. A 2.2M-parameter purpose-built multivariate transformer beats 200M+ time-series foundation models (Chronos-2, Sundial, TabPFN-TS) by 2–3pp. Load forecasting is fundamentally cross-region, neighboring BAs co-move under shared weather and economic cycles, and the permutation-invariant iTransformer treats every BA as a cross-attention token. Univariate foundation models can't encode that, and no amount of scale recovers it.

Quantifying the ceiling. A Chinchilla-style fit over a 4×4 size×data sweep puts the floor at E∞ ≈ 0.85% and shows the model-size axis is dead while the data axis still has slope (β ≈ 0.5). That turns "should we keep trying?" into a number, and says the remaining gain is a data-acquisition question, not a modeling one.

A discipline that refuted itself. The most exciting interim finding, "MAPE is operationally worthless", died when we required a committed recompute before publishing. Under two dispatch models the forecast clearly pays. The rule "no result counts until it reproduces from a committed script" is what makes every number on this site defensible.

Method

I set direction in plain language and an orchestrating agent decomposed each call into briefs for specialized sub-agents that scraped data, wrote training and eval scripts, dispatched GPU jobs, ran paired-bootstrap tests, and committed results with SHIP/KEEP/DROP verdicts. The directions page traces the specific calls that shaped the outcome (multi-seed everything, recompute the dispatch finding from a committed script, match the eval to prior SOTA exactly).

All numbers reproduce from committed scripts. Eval protocol held fixed throughout: 8 holdout BAs never trained, Sep–Oct 2024 validation, one-shot 2025 test, seed 42 with multi-seed confirmation for headline claims, MAPE on observed>0.

How low can a load forecast go, and does it matter?