← GridFloor

Experiments ledger — every run, why, and what broke

22 experiments over 2026-05-18 → 05-26 · click a verdict to filter · all on the identical eval

The problem & the eval

The US grid is split into ~60 balancing authorities (BAs) — independent control areas (CAISO, PJM, ERCOT, …) that must each match generation to demand in real time. To schedule generators economically, an operator needs tomorrow's hourly demand today: a day-ahead load forecast. Errors are expensive in both directions — under-forecast and you scramble for peaking plants or shed load; over-forecast and you hold costly spinning reserves that burn fuel for nothing. So even a fraction of a percent of MAPE compounds into millions of dollars and megatons of CO₂ across a year.

task

Given 168h of history (+ static BA features), predict the next 24h of hourly demand, per BA. Quantile outputs (q10/q50/q90) for calibrated intervals.

data

EIA-930 hourly demand, 42 training BAs × 2018–2025 (4.4M rows) + FERC-714 utility hourly (2010–24). Public, federal.

eval slice

One-shot test: Nov 16–30 2024 + all of 2025-Feb–Dec, on 7 of 8 held-out BAs the model never trained on (LDWP excluded from the headline due to EIA-930 telemetry artifacts; see guarantee ① below). The hardest setting: out-of-distribution grids, forward in time.

metric

MAPE on observed>0 (mean abs % error), macro-averaged over BAs. ~1% ≈ a 1 GW typical miss on a 100 GW grid — one large power plant.

Progress over time

The production baseline is each BA's own operational day-ahead forecast: 9.04% MAPE on the 2025 hold-out. The chart shows where the modeling work landed against that, on the identical metric. Most of the drop came from data + compute scaling; the curve then flattens at the fitted floor.

log-scale MAPE · ● = SOTA milestone · ★ = headline · red dashed = fitted floor E∞ ≈ 0.85%

Who else was scored on this exact slice

"State of the art" only means something on a shared eval. The literature reports BA / zonal / system load MAPE across wildly different test sets, horizons, years, and even metric definitions (sMAPE, WAPE) — none directly comparable. So rather than cite incompatible numbers, we ran (or directly measured) every contender on the identical protocol above. The incumbent operational forecast is the one that matters: it predicts the same series, same hours, that utilities dispatch against today.

all on hold-ex-LDWP 2025-Feb–Dec, same MAPE-on-(obs>0), one-shot · external published work uses different slices and is not plotted

How we know the test set wasn't poisoned

A "1% MAPE" is only meaningful if the test genuinely differs from training. Six guarantees, enforced in code on every run:

① held-out entities, not just hours

8 BAs (SWPP, DUK, BPAT, TVA, AZPS, NEVP, PSEI, LDWP) are entirely absent from training — the model has never seen these grids at all, at any time. We test on 7 of them. This is OOD generalization, not interpolation.

② assert gate in every loader
assert not train.ba.isin(HOLDOUT_BAS).any()

runs inside each training script and each per-branch test suite. A holdout BA in training data fails the build.

③ one-shot temporal split

Validation = Sep–Oct 2024. Test = Nov 16–30 2024 + 2025-Feb–Dec, touched once. No hyper-parameter was tuned against the test window.

④ pretest-only preprocessing

Every normalizer (per-BA/channel z-scores) and the LDWP cleaning thresholds are fit only on ≤2024-11-15 data — verified by test_*_pretest_only. No test statistic leaks into the inputs.

⑤ multi-seed confirmation

Headline gains are confirmed across seeds {42,43,44,7,13} with paired-bootstrap CIs — so a lucky single-seed draw can't masquerade as SOTA (this is what caught BA-mixup).

⑥ recompute-and-commit

No number is reported until it reproduces from a committed script. This caught a false "MAPE-is-useless" finding that existed only in an agent transcript.

The full ledger

Dates are commit dates on the project's git branches. Verdicts: SHIP = adopted · KEEP = real but minor · PARTIAL = improved but below gate · DROP = negative / noise. Every row links to a committed branch (see the flowchart node drawers for exact scripts + SHAs).