gridfloor

22 experiments over 2026-05-18 → 05-26 · click a verdict to filter · all on the identical eval

The problem & the eval

The US grid is split into ~60 balancing authorities (BAs), independent control areas (CAISO, PJM, ERCOT, …) that must each match generation to demand in real time. To schedule generators economically, an operator needs tomorrow's hourly demand today: a day-ahead load forecast. Errors are expensive in both directions, under-forecast and you scramble for peaking plants or shed load; over-forecast and you hold costly spinning reserves that burn fuel for nothing. So even a fraction of a percent of MAPE compounds into millions of dollars and megatons of CO₂ across a year.

task

Given 168h of history (+ static BA features), predict the next 24h of hourly demand, per BA. Quantile outputs (q10/q50/q90) for calibrated intervals.

data

EIA-930 hourly demand, 42 training BAs × 2018–2025 (4.4M rows) + FERC-714 utility hourly (2010–24). Public, federal.

eval slice

One-shot test: Nov 16–30 2024 + all of 2025-Feb–Dec, on 7 of 8 held-out BAs the model never trained on (LDWP excluded from the headline due to EIA-930 telemetry artifacts; see guarantee ① below). The hardest setting: out-of-distribution grids, forward in time.

metric

MAPE on observed>0 (mean abs % error), macro-averaged over BAs. ~1% ≈ a 1 GW typical miss on a 100 GW grid, one large power plant.

Progress over time

The production baseline is each BA's own operational day-ahead forecast: 9.04% MAPE on the 2025 hold-out. The chart shows where the modeling work landed against that, on the identical metric. Most of the drop came from data + compute scaling; the curve then flattens at the fitted floor.

log-scale MAPE · ● = SOTA milestone · ★ = headline · red dashed = fitted floor E∞ ≈ 0.85%

Who else was scored on this exact slice

"State of the art" only means something on a shared eval. The literature reports BA / zonal / system load MAPE across wildly different test sets, horizons, years, and even metric definitions (sMAPE, WAPE), none directly comparable. So rather than cite incompatible numbers, we ran (or directly measured) every contender on the identical protocol above. The incumbent operational forecast is the one that matters: it predicts the same series, same hours, that utilities dispatch against today.

all on hold-ex-LDWP 2025-Feb–Dec, same MAPE-on-(obs>0), one-shot · external published work uses different slices and is not plotted

Why the 1.02% isn't a data leak

A headline MAPE is only meaningful if the test set is genuinely held out from training and tuning. Six guarantees, enforced in code on every run:

held-out entities, not just held-out hours

8 BAs (SWPP, DUK, BPAT, TVA, AZPS, NEVP, PSEI, LDWP) are entirely absent from training, the model has never seen these grids at all, at any time. We test on 7 of them. This is OOD generalization, not interpolation.

runtime guard in every loader

Every training script and per-branch test suite runs an assertion that fails the build if any holdout BA appears in the training partition. Loader-side, not test-side, caught at the data boundary before any gradients flow.

one-shot temporal split

Validation = Sep–Oct 2024. Test = Nov 16–30 2024 + 2025-Feb–Dec, touched once. No hyperparameter was tuned against the test window.

pretest-only preprocessing

Every normalizer (per-BA/channel z-scores) and the LDWP cleaning thresholds are fit only on ≤2024-11-15 data, verified by a dedicated test suite. No test statistic leaks back into the inputs.

multi-seed confirmation

Headline gains are confirmed across seeds {42, 43, 44, 7, 13} with paired-bootstrap CIs, so a lucky single-seed draw can't masquerade as SOTA. This is what caught BA-mixup.

recompute-and-commit

No number is reported until it reproduces from a committed script. This caught a false "MAPE-is-useless" finding that existed only in an agent transcript.

The full ledger

Dates are commit dates on the project's git branches. Verdicts: SHIP = adopted · KEEP = real but minor · PARTIAL = improved but below gate · DROP = negative / noise. Every row links to a committed branch (see the flowchart node drawers for exact scripts + SHAs).

Experiments ledger

The problem & the eval

Progress over time

Who else was scored on this exact slice

Why the 1.02% isn't a data leak

The full ledger