Cheatsheet

a short read for someone new to grid forecasting · what the problem is, what we found, and the vocabulary

The problem in one paragraph

The US grid is split into ~66 control areas called balancing authorities (CAISO, PJM, ERCOT, …). Each has to match electricity generation to electricity demand in real time. To schedule generators economically, they need to know tomorrow's hourly demand today, a day-ahead load forecast. Errors are expensive both ways: under-forecast and you scramble for peaking plants (or shed load); over-forecast and you hold costly spinning reserves that burn fuel for nothing. Even a fraction of a percent of error compounds into millions of dollars and megatons of CO₂ across a year.

The result, in three lines

1.02% MAPE on 7 of 8 held-out BAs (hold-ex-LDWP) forward-2025, versus each BA's own published day-ahead forecast at 9.04% on the identical eval. ~8.9× better, ~0.1pp above the fitted floor.

finding	what it says
GridFloor SOTA (multi-seed)	1.03% MAPE, near the empirical floor from our scaling sweep (best observed cell 1.061%; fitted asymptote ≈ 0.85% under a Chinchilla-style additive form, directional, not a calibrated CI).
Foundation models plateau	Chronos-2, Sundial, TabPFN-TS all stuck at 3–4%. They're univariate; can't encode cross-BA structure.
Accuracy → operational value	The forecast captures 86–89% of perfect-foresight battery-dispatch value above 24h-persistence, under both peak-shaving and price arbitrage.
Only data moves it	Model size + epochs are dead levers; data is the only one with slope, and same-distribution data is nearly exhausted.

What we actually built (not an off-the-shelf model)

The model architecture is in the iTransformer family, but the substantive contributions live around it. The result number depends on every one of these:

A permutation-invariant adaptation of the iTransformer. Each BA becomes a cross-attention token; instance-norm + quantile loss + static BA features form a recipe that generalizes to BAs never seen in training. That's what makes one-shot evaluation on 7 held-out BAs honest, and what foundation models structurally can't do.
A scaling sweep specific to this task. 4×4 size×data sweep (18 cells) fit with the Chinchilla-style additive form. Directional, not Kaplan-grade, the sweep spans ~2 orders of magnitude in N and ~1 in D, which isn't enough to calibrate α, β, or E∞ tightly. What's robust: at the sizes we tested, the parameter axis is dead and the data axis still has slope. The FERC-714 XBRL extension (+29% data → −0.014pp MAPE) hit that direction, validating "data still pays" as a finding even though the sweep can't validate β as a number.
FERC-714 XBRL data recovery. We wrote an XBRL parser to pull utility-level hourly demand from the messy FERC filings back to 2010, recovering the only piece of in-distribution data that measurably moved the metric, exactly as the scaling law predicted.
Anti-pollution discipline enforced in code. Assert gates in every loader, pretest-only normalization, one-shot temporal split, multi-seed paired-bootstrap on every headline gain. This protocol, not the model, is what caught BA-mixup, the L80 over-claim, and the "MAPE is operationally useless" mirage.
An operational-value framework. A battery-dispatch simulator under two objectives (peak-shaving + price arbitrage on real ERCOT 2024 hub prices) that translates MAPE into the units operators actually care about. Refuted our own most exciting interim finding.
Controlled head-to-head re-running the competitor. We re-trained PatchTST and a plain iTransformer ourselves on the identical Hong & Lee 2026 split, with a faithfulness check (our PatchTST reproduces their published 3.66 to 0.02pp). The comparison stops depending on which published table their competitor number lived in. Against plain iTransformer (no perm-invariance, no instance-norm, no quantile loss) on this controlled ISO eval our recipe wins 3.52 vs 3.69 macro, CI [−0.196, −0.142] excludes zero, that's the closest direct ablation of "the recipe beyond an off-the-shelf iTransformer." For the headline held-out-BA task, plain iTransformer can't even run (positional embeddings tie each variate to a position; no slot for a never-seen BA at test time), perm-invariance there is a prerequisite, not an ablation knob.
An EIA-930 telemetry-artifact audit. Caught that LDWP's "3% MAPE" was the meter, not the model, pretest-derived cleaning mask, documented as a dataset issue not a modeling one.
A documented graveyard. 13 negatives, BA-mixup, MoE, foundation-model stacks, GridLAB-D synth v1/v2, pre-2018 EIA, interchange + fuel-mix channels, weather covariates, each with a why. Negative results are the rare-to-publish artifact and the most useful single output of this project for anyone working the same problem.

The model alone is ~200 lines of PyTorch. The other 90% of the work is what makes the 1.02% defensible.

How it's measured

The headline metric is MAPE on observed-positive hours, macro-averaged across the 7 held-out BAs, on a one-shot forward window (Nov 16 2024 + 2025-Feb–Dec). 8 BAs are entirely absent from training, so the metric measures genuine out-of-distribution generalization, not memorization on data the model has seen.

How we estimated the floor, scaling sweep, not a Kaplan-grade law

We ran an 18-cell scaling sweep at 4 model sizes × 4 data scales (16 cells at 20 epochs) + 2 compute-corrected 40-epoch cells. To each set of 18 MAPE numbers we fit the Chinchilla-style additive form MAPE(N, D) = E∞ + A·N⁻ᵅ + B·D⁻ᵝ. The fit gives E∞ ≈ 0.85% (range 0.84–0.91 across slice fits), a suggestive floor, not a calibrated one. The shape of the surface is what's load-bearing here, not the parameter values.

Each line is one model size across data scales · ★ = 40-epoch (compute-corrected) cell · dashed red = fitted floor E∞ ≈ 0.85%, directional, not a calibrated CI.

Two readings the picture supports (qualitatively, not quantitatively):

Model-size axis is dead at the sizes we tested. L (2.2M) ties or beats XL (7M) at 100% data on 20 ep, once you train at matched compute, the bigger model buys nothing in our regime. (Honest caveat: this might just mean XL needs re-tuned LR + warmup. We did not re-tune per-N.)
Data axis still has slope. Every model size's MAPE drops monotonically as data grows. β ≈ 0.5 falls out of the fit, broadly consistent with Chinchilla-flavored time-series literature; the FERC-714 XBRL extension (+29% data → −0.014pp MAPE) hits the projected direction, but the test is too small to validate β as a number.

What this sweep is and isn't. A Kaplan/Hoffmann-grade scaling law needs ≥4 orders of magnitude in N, ≥3 orders in D, isoFLOPs compute-optimal sweeps, ≥3 seeds per cell, per-N HP tuning, and fit on NLL not MAPE. Ours covers ~2 orders in N (83K → 7.2M, 86×) and ~1 order in D (8×), with single-seed cells at most points and one shared HP schedule. Treat the surface as directional evidence, "data axis still pays, size axis doesn't at these sizes", rather than as a calibrated estimate of α, β, or E∞. The L × 40ep cell at 1.061% is the lowest we ever observed; the parametric extrapolation says there's a bit more room below it, but the headroom number is not tight.

Why we test on two dispatch modes

The "does accuracy matter" claim is much stronger if it holds under two different uses of the forecast, they stress it in different ways. We test both:

mode	what the battery does	needs from forecast	persistence floor	forecast captures above floor
peak-shaving	discharge during the daily demand peak to chop the monthly demand-charge bill (real money for industrial customers + utilities deferring transmission upgrades)	which hour will be the peak, a single hour that moves day-to-day with weather	~51%, persistence misses shifted peaks (heat dome shifts peak from 5pm to 7pm; persistence still discharges at 5pm)	88.6% of perfect-foresight value
price arbitrage	buy electricity cheap, sell expensive, ride wholesale LMP swings (a multi-billion-dollar real industry; ERCOT alone has ~5 GW of batteries chasing it)	shape of tomorrow's whole-day hourly price curve (since price tracks load, good load forecast → good price forecast)	~75%, daily price cycle is fairly regular, even repeat-yesterday captures the shape	85.9% of perfect-foresight value

Different sensitivities: peak-shaving lives or dies on one hour's timing; arbitrage lives on the whole-day shape. If the forecast captured value under only one we'd have to caveat ("accuracy pays for this use case"). It captures 86–89% under both, so the claim "MAPE translates to operational value" is robust to which dispatch problem you actually care about.

Per-model dollar savings, three battery durations

We ran the dispatch sim across the foundation-model forecasts we'd already trained, at three battery durations, and converted to dollars per MW of installed battery per year. Tariff: $15/kW-month demand charge (midpoint of PG&E E-19 and ConEd SC9-II). Arbitrage: real ERCOT-2024 day-ahead-market hub prices. Why three durations: 2h dominates installed C&I storage today, 4h is the FERC Order 841 / CAISO RA reference, 8h is the long-duration / utility-scale regime.

signal	2h total $/MW-yr	4h total $/MW-yr	8h total $/MW-yr	Δ vs persist-24h (4h)
perfect foresight (ceiling)	$93k	$184k	$246k	—
GridFloor (ours)	$84k	$177k	$242k	+$47k
Chronos-2 fine-tuned	$73k	$151k	$221k	+$22k
Chronos-2 zero-shot	$69k	$149k	$220k	+$19k
Sundial	$58k	$128k	$209k	−$1k
persist-24h	$58k	$130k	$193k	—
persist-168h	$58k	$126k	$188k	−$4k

Bench: 49 EIA-930 BAs ex-LDWP · 2025 Feb-Dec · 15,974 BA-days × 3 durations · 10% of mean BA demand · branch feature/dispatch-bench.

The Sundial monotone break is duration-dependent. You'd expect lower MAPE → higher dollar savings. Sundial has lower MAPE than persist-24h (4.85% vs 6.22%) at every duration, but on peak-shaving dollars:

2h: Sundial $25k vs persist-24h $28k, loses by $3k (monotone broken)
4h: Sundial $72k vs persist-24h $77k, loses by $5k (monotone broken)
8h: Sundial $126k vs persist-24h $114k, wins by $12k (monotone restored)

So it's not a 4-hour quirk: the break persists wherever peak-hour timing matters, and only vanishes once the battery is long enough that shoulder hours cover up Sundial's tail-day mis-calls. Tail-day accuracy beats mean MAPE at every duration short enough to actually depend on timing. The percent-capture aggregate hides this, only the dollar conversion surfaces it.

$/MW-year is the academic-standard unit but it's the wrong unit for "is this a lot." Anchored to actual fleet sizes at the 4-hour reference duration:

scope	scale	vs persist-24h /yr	vs best foundation model /yr
per MW-year	1 MW (smallest C&I unit)	+$47k	+$25k
typical utility BESS	100 MW per system	+$4.7M	+$2.5M
big utility fleet	1 GW deployed (PG&E-class)	+$47M	+$25M
US grid-scale storage EOY 2025	~41 GW (EIA: 26 GW EOY 2024 + ~15 GW added in 2025)	~$1.9B	~$1.0B

How to read these numbers, honestly. Real grid-scale BESS earns from up to 5 revenue streams; we modeled 2 of them. Here's where each stands and what's recently changed about how they stack:

revenue stream	modeled?	where it stands in 2025
energy arbitrage (price-curve trading)	✓ yes, ERCOT 2024 DAM hub prices	Now the majority of ERCOT BESS revenue (per Modo Energy Q3 2025)
demand-charge mgmt (peak shaving)	✓ yes, $15/kW-mo midpoint (PG&E E-19 / ConEd SC9-II)	steady; relevant for C&I behind-the-meter storage
capacity / Resource Adequacy payments	✗ no	~$50–85/kW-yr in CAISO; not in our sim
ancillary services (Reg, ECRS, spin)	✗ no	fell ~90% from 2023 peak as ERCOT batteries saturated (per pv-magazine Nov 2025); was 84% of revenue in 2023, now 48% and falling
wholesale energy in ancillary-cleared hours	✗ no	secondary stream, market-dependent

What this means for our number. In 2023 a "we modeled only the least forecast-sensitive streams" caveat would have been correct, ancillaries dominated. In 2025 it's reversed: arbitrage and demand charges are the dominant revenue lines, so our dispatch sim covers the streams that actually pay grid-scale BESS today. The 3 streams we missed still exist and forecast accuracy still helps in them, but they're a smaller fraction of the modern revenue stack than the 2-stream sim suggests, so our $/MW-yr is closer to a calibrated point estimate than a floor (directionally honest, not deliberately conservative).

GridFloor instead of persistence is worth +$47k per MW of battery per year at the 4-hour reference; multiplied by the ~41 GW US grid-scale fleet (EOY 2025), that's ~$1.9B/year the country leaves on the table by dispatching on persistence, and ~$1.0B/year by dispatching on the best foundation model (Chronos-2 FT).

One subtlety worth noting separately. Two ways to read the same dispatch result tell opposite stories: by percent of perfect-foresight value, GridFloor captures 89% → 95% → 97% across 2h / 4h / 8h durations (gap to perfect shrinks). By absolute $/MW-yr above persistence, the same model adds +$25k / +$47k / +$49k (gap stays roughly constant). The percent metric makes longer batteries look like they need less forecast accuracy; the dollar metric shows they don't, because the perfect-foresight ceiling grows with duration just as fast as the model gap. If you care about money rather than vanity capture-rate, absolute dollars is the right axis.

Glossary

term	what it is
BA (balancing authority)	Entity that physically balances supply & demand on a chunk of the US grid. ~66 total. Examples: `CISO` (California ISO), `PJM`, `ERCO` (Texas), `MISO`, `BPAT` (Pacific NW), `DUK` (Carolinas), `LDWP` (LA Dept. of Water & Power).
EIA-930	The federal hourly grid feed. For every BA, every hour: demand, the BA's own published day-ahead forecast, generation, fuel mix, interchange. The source dataset.
day-ahead load forecast	Prediction made today for tomorrow's hourly demand. The series operators actually schedule generation against.
MAPE	Mean Absolute Percentage Error. "1% MAPE" means the average hourly miss is 1% of the true value.
incumbent forecast	Each BA's own published day-ahead forecast (also in EIA-930). The baseline that matters, what utilities actually dispatch against today.
held-out BAs	8 BAs (`SWPP`, `DUK`, `BPAT`, `TVA`, `AZPS`, `NEVP`, `PSEI`, `LDWP`) the model never sees in training. The headline 1.02% is on 7, `LDWP` is excluded because of EIA-930 telemetry dropouts (~0.13% of hours show bogus near-zero readings; we document it + ship a cleaning rule, but keep it out of the headline so a dataset artifact doesn't contaminate the metric).
iTransformer	A normal time-series transformer (Informer, PatchTST, vanilla) treats each time step as a token and uses attention to model hour-to-hour dependencies within one series. The iTransformer (Liu et al. 2024) inverts this: each variate, for us, each BA's whole 168h history embedded into one vector, is a token, and attention models cross-BA dependencies (CAISO ↔ ERCOT ↔ PJM …). For load forecasting that's the dominant signal: heatwaves move across regions, weekend dips hit everyone at once. Foundation models miss it because they're univariate.
perm-invariant iTransformer	Our recipe: order-of-BAs doesn't matter, so the trained model generalizes to BAs it never saw.
foundation model	Large pre-trained time-series model (Chronos-2, Sundial, TimesFM, …). Univariate by design, sees one series at a time. The structural reason they can't compete here.
floor E∞	The irreducible error this architecture asymptotes to as model size and data grow. Our 18-cell sweep fits ≈ 0.85% under a Chinchilla-style additive form, with the best observed cell at 1.061%. Treat as directional, the sweep doesn't span enough decades in N and D to calibrate a real scaling law.
dispatch value	How much money / CO₂ a forecast saves a battery operator who schedules charge/discharge against it. Measured as a fraction of perfect-foresight value above 24h-persistence.
FERC (Federal Energy Regulatory Commission)	US agency that regulates interstate wholesale electricity (and gas) markets, approves ISO rules, sets wholesale market structure, requires data filings. "FERC Order 841" is the 2018 ruling that forced ISOs to let battery storage participate in wholesale energy + ancillary markets, which is why 4-hour became the de-facto reference battery duration.
ERCOT (Electric Reliability Council of Texas)	One of the 7 US ISOs and the only one whose grid is electrically isolated from the rest of the country (the "Texas interconnect"). Largest US ISO by hourly peak load (~85 GW). Runs an aggressive energy-only market with the deepest battery-storage deployment in the US (~15 GW EOY 2025). Used as our arbitrage price source because their day-ahead market data is the most publicly accessible.
LMP (Locational Marginal Price)	The wholesale electricity price at a specific node or zone of an ISO grid, published hour-by-hour by the day-ahead and real-time markets. Varies by location because of transmission congestion and line losses. `HB_BUSAVG` (Hub Bus Average) is ERCOT's average across its hub nodes, what we use as a representative system-wide arbitrage signal in the dispatch sim.
FERC Form 714	Annual FERC filing where every US balancing authority + planning area reports its hourly demand history. The only public source of utility-level hourly load data outside EIA-930 (which only starts in 2015). We extracted 2010–2024 hourly demand from this dataset to extend the training corpus by +29%.
XBRL (eXtensible Business Reporting Language)	XML-based standard the SEC and FERC use for structured business + utility filings. FERC requires Form 714 in XBRL since 2010; we wrote a parser to pull hourly demand out of the XML tree so the data could be added to training. The only data source that empirically moved MAPE on hold-ex-LDWP (−0.014pp from +29% data, matching the scaling sweep's β ≈ 0.5 direction).
multi-seed / paired bootstrap	Confirming a "win" by running multiple seeds and requiring the CI on the delta to exclude zero. Killed several apparent wins (e.g. BA-mixup).
SHIP / KEEP / PARTIAL / DROP	Verdict tags on every experiment. SHIP = adopted, KEEP = real but minor, PARTIAL = improved but below gate, DROP = negative or within noise.

Why this matters in the real world

It's the metric utilities actually run against. Grid operators commit generation a day ahead against the load forecast. A 1.02 vs 9.04 gap on a 100 GW grid is ~1 GW less missed per hour, about one large power plant. Lower error means fewer reserve plants kept "just in case" → lower cost, lower emissions. The dispatch finding shows the gain isn't academic: 86–89% of the value perfect foresight would capture, above the 24h-persistence floor, under both peak-shaving and price arbitrage.

It bounds the field. The scaling-law floor at E∞ ≈ 0.85% says no amount of extra parameters or epochs gets lower. The remaining 0.1pp is a data-acquisition question, not a modeling one, and same-distribution data is nearly exhausted. So the honest next direction is structural (sub-BA zonal expansion, renewable-gen as a parallel target) rather than bigger models.

What's still open

Close the last 0.1pp to the floor, would require more in-distribution data: sub-BA zonal series, careful handling of AEMO / ENTSO-E distribution shift.
Per-region renewable forecasting, same cross-region structure should help; weather is the upstream driver, a perfect candidate for the decomposed approach.
Net-load / duck-curve forecasting, the operationally-critical one for dispatch scheduling as solar penetration grows.
Calibrated probabilistic intervals, point MAPE is solved; sharpness + coverage on the desert-Southwest BAs is genuinely hard (interval width is bounded by point error).