← GridFloor
Scoreboard — the whole result in one screen
at-a-glance TL;DR · each block expanded elsewhere (links inline) · hold-ex-LDWP 2025-Feb–Dec · all numbers from committed branches
Headline
verified SOTA (multi-seed)
1.030%
L × 80ep × XBRL, 5-seed mean ±0.010. vs ~2.3% industry baseline.
scaling-law floor E∞
0.84–0.91%
Fitted on 18-cell sweep. We sit ~0.1pp above the ceiling.
failed interventions
13
Every feature / arch / synthetic-data add: negative or within-noise.
dispatch value captured
86–89%
Of perfect-foresight value above persist24, under both dispatch models.
Three findings
1 · Forecast accuracy buys operational value refutes "MAPE useless"
An uncommitted analysis claimed MAPE gains were operationally worthless (0.8% of dispatch value). Committed recomputes under both peak-shaving and price-arbitrage battery dispatch overturn it: the forecast captures 86–89% of the perfect-foresight value 24h-persistence leaves on the table. The "0.8%" reproduced under neither model — it survived only while uncommitted.
2 · Foundation models structurally can't compete DROP
Chronos-2 (zero-shot 3.48%, fine-tuned 3.34%), Sundial (4.02%), TabPFN-TS all plateau near 3–4% MAPE. They're univariate; the cross-BA correlation structure is the entire game, and a 2.2M-param multivariate transformer encodes it where 200M-param univariate models can't. Fine-tuning moved Chronos only 0.14pp and didn't decorrelate it (r 0.24 → 0.25).
3 · Data is the only lever — and nearly exhausted
Model-size axis saturated (L ties XL). Epoch axis plateaus at ~80. Data axis still has slope (β ≈ 0.5; XBRL +29% data delivered its predicted +0.014pp to 0.001pp) but same-distribution data is scarce. Redundant signal (weather, interchange, fuel-mix) doesn't help; mismatched data (pre-2018, cross-continent, synthetic) hurts.
| configuration | MAPE % | status |
| Scaling-law L × 20ep | 1.080 | sweep cell |
| L × 40ep × XBRL (4-seed mean) | 1.034 | verified |
| L × 80ep × XBRL (5-seed mean) | 1.030 | verified SOTA |
| L × 80ep × XBRL (seed 42) | 1.020 | best single seed |
peak-shaving battery · forecast MAPE 1.63% · persist24 6.22% · persist168 9.87%
Foundation models — the univariate ceiling
| model | solo MAPE % | note |
| L iTransformer (ours) | 1.03 | multivariate, 2.2M params |
| Chronos-2 zero-shot | 3.48 | univariate, 200M+ |
| Chronos-2 fine-tuned | 3.34 | full FT on our panel |
| Sundial base-128m | 4.02 | univariate |
The graveyard
13 of 16 interventions returned negative or within-noise — BA-mixup, MoE, foundation static/dynamic/fine-tuned stacks, GridLAB-D synth (v1/v2), ResStock, pre-2018 EIA (raw + normalized), FERC parser, interchange/fuel-mix channels. The throughline: the model already extracts everything from demand history + cross-BA structure; redundant signal doesn't help and mismatched data hurts. → full ledger with dates, motivation, and what broke
Shipped artifacts
SOTA checkpoint
L × 40–80ep
~1.03% multi-seed. Operating point — don't train longer.
per-BA online ACI
0.66 → 0.80
NEVP/AZPS PI80 coverage, zero-cost post-process.
LDWP cleaning rule
3.01 → 2.34%
Masks 10/7993 telemetry-artifact hours; pretest-derived.
Every number cross-referenced to a committed branch (l-80ep-multiseed-v5, scaling-law, foundation-stack, chronos-finetuned, per-ba-aci, ldwp-audit, synthesis-final). Anti-pollution: 8 holdout BAs never in training · Sep–Oct 2024 val · one-shot 2025 test · seed 42 with multi-seed confirmation for headline claims.