gridfloor

at-a-glance TL;DR · each block expanded elsewhere (links inline) · hold-ex-LDWP 2025-Feb–Dec · all numbers from committed branches

Headline

verified SOTA (multi-seed)

1.030%

L × 80ep × XBRL, 5-seed mean ±0.010. vs ~2.3% industry baseline.

scaling-law floor E∞

0.84–0.91%

Fitted on 18-cell sweep. We sit ~0.1pp above the ceiling.

failed interventions

Every feature / arch / synthetic-data add: negative or within-noise.

dispatch value over persist-24h

+$47k/MW-yr

4h battery, $15/kW-mo + ERCOT '24 prices. × ~41 GW US fleet (EIA EOY 2025) = ~$1.9B/yr.

Three findings

1 · Forecast accuracy buys real operational dollars refutes "MAPE useless"
An uncommitted analysis claimed MAPE gains were operationally worthless (0.8% of dispatch value). Committed recomputes under both peak-shaving and price-arbitrage battery dispatch overturn it. In absolute dollars: GridFloor adds +$25k / $47k / $49k per MW-year over persist-24h at 2h / 4h / 8h battery durations, multiplied by the ~41 GW US grid-scale fleet (EIA EOY 2025) that's ~$1.9B/year vs persistence, ~$1.0B/year vs the best foundation model (Chronos-2 FT). The "0.8%" reproduced under neither dispatch model.

Two subtleties worth noting separately. (a) %-of-perfect-foresight grows fast with duration (89% → 95% → 97% across 2h/4h/8h) but the absolute $ lead stays flat, long batteries are not forecast-insensitive, %-capture just hides the constant absolute spread. (b) Sundial breaks the MAPE→$ monotone: lower MAPE than persist-24h but worse on peak-shaving $ at 2h+4h, only restored at 8h once shoulder hours hide its tail-day mis-calls. Tail-day accuracy beats mean MAPE wherever timing matters.

Revenue-stack honesty: we modeled 2 of typical 5 BESS revenue streams (arbitrage + demand charges). Capacity/RA payments, ancillary services, and ancillary-cleared energy aren't in the sim. In 2023 that would have understated the value materially, ancillaries were 84% of ERCOT BESS revenue. By 2025 ancillaries fell ~90% from peak and energy arbitrage (which we do model) is now the majority stream per Modo Energy Q3 2025, so our coverage of the modern revenue mix is closer to the dominant streams than to "we only got the small ones."

2 · Foundation models structurally can't compete DROP
We controlled-ran six foundation models on hold-ex-LDWP: Chronos-2 ZS (3.48%) and FT (3.34%), Sundial (4.02%), TimesFM 2.0 (4.22%), Moirai 1.1-R-large (4.32%), TabPFN-TS (5.46%, MASE 1.06, loses to seasonal-naive). All plateau 3-5.5% MAPE despite scale (TimesFM 500M, Moirai 311M, Chronos 200M+). They're univariate; the cross-BA correlation structure is the entire game, and a 2.2M-param multivariate transformer encodes it where bigger univariate models can't. Fine-tuning Chronos moved 0.14pp and didn't decorrelate (r 0.24 → 0.25). The surprise: TimesFM/Moirai both worse than Chronos-2 on US BA load despite beating it on general benchmarks, the univariate ceiling is tighter than the literature suggests.

3 · Data is the only lever, and nearly exhausted
Model-size axis saturated (L ties XL). Epoch axis plateaus at ~80. Data axis still has slope (β ≈ 0.5; XBRL +29% data delivered its predicted +0.014pp to 0.001pp) but same-distribution data is scarce. Redundant signal (weather, interchange, fuel-mix) doesn't help; mismatched data (pre-2018, cross-continent, synthetic) hurts.

SOTA + scaling → progress + ledger

configuration	MAPE %	status
Scaling-law L × 20ep	1.080	sweep cell
L × 40ep × XBRL (4-seed mean)	1.034	verified
L × 80ep × XBRL (5-seed mean)	1.030	verified SOTA
L × 80ep × XBRL (seed 42)	1.020	best single seed

Dispatch value capture → interactive

three battery durations · $15/kW-mo demand charge + ERCOT '24 hub-bus-avg prices · 49 BAs ex-LDWP · 2025-Feb-Dec

Same %-of-perfect bars in case they're useful for sanity: persist168 48.8% · persist24 51.5% · forecast 94.5% (4h peak-shaving), note that %-of-perfect at 8h hits 97% for GridFloor and 67% for persist-24h, but the absolute $ gap stays flat.

Foundation models, the univariate ceiling, controlled head-to-head

all 6 FMs scored on the identical hold-ex-LDWP slice, day-ahead 24h, no fine-tuning except where noted · all univariate

signal	2h $/MW-yr	4h $/MW-yr	8h $/MW-yr	Δ vs persist-24h (4h)
perfect foresight	$93k	$184k	$246k	—
GridFloor (SOTA)	$84k	$177k	$242k	+$47k
Chronos-2 fine-tuned	$73k	$151k	$221k	+$22k
Sundial	$58k	$128k	$209k	−$1k
persist-24h (baseline)	$58k	$130k	$193k	—

model	MAPE %	MASE	nMAE %	note
L iTransformer (ours)	1.02	0.19	0.98	multivariate, 2.2M params
Chronos-2 fine-tuned	3.34	—	—	200M+, full FT on our panel
Chronos-2 zero-shot	3.48	—	—	univariate
Sundial base-128m	4.02	—	—	univariate, flow-matching
TimesFM 2.0	4.22	0.80	4.18	500M, decoder-only (Google)
Moirai 1.1-R-large	4.32	0.82	4.25	311M, masked encoder (Salesforce)
TabPFN-TS	5.46	1.06	5.52	raw TabPFN + lag/calendar features · only FM that loses to seasonal-naive

The surprise: TimesFM 2.0 (500M) and Moirai (311M) both land worse than Chronos-2 ZS (200M), opposite of what their published benchmarks would predict. On stable-climate US BA load, the univariate ceiling sits at ~3.3-4.3% regardless of parameter count or pretraining corpus.

The graveyard

13 documented negatives in the graveyard: BA-mixup, MoE, foundation static/dynamic/fine-tuned stacks, GridLAB-D synth (v1/v2), ResStock, pre-2018 EIA (raw + normalized), FERC parser, interchange/fuel-mix channels. The throughline: the model already extracts everything from demand history + cross-BA structure; redundant signal doesn't help and mismatched data hurts. → full ledger with dates, motivation, and what broke

Shipped artifacts

Every number cross-referenced to a committed branch (l-80ep-multiseed-v5, scaling-law, foundation-stack, chronos-finetuned, per-ba-aci, ldwp-audit, synthesis-final). Anti-pollution: 8 holdout BAs never in training · Sep–Oct 2024 val · one-shot 2025 test · seed 42 with multi-seed confirmation for headline claims.