← GridFloor

Cheatsheet — the whole project in one page

a short read for someone new to grid forecasting · what the problem is, what we found, and the vocabulary

The problem in one paragraph

The US grid is split into ~66 control areas called balancing authorities (CAISO, PJM, ERCOT, …). Each has to match electricity generation to electricity demand in real time. To schedule generators economically, they need to know tomorrow's hourly demand today — a day-ahead load forecast. Errors are expensive both ways: under-forecast and you scramble for peaking plants (or shed load); over-forecast and you hold costly spinning reserves that burn fuel for nothing. Even a fraction of a percent of error compounds into millions of dollars and megatons of CO₂ across a year.

The result, in three lines

1.02% MAPE on 7 of 8 held-out BAs (hold-ex-LDWP) forward-2025 — versus each BA's own published day-ahead forecast at 9.04% on the identical eval. ~8.9× better, ~0.1pp above the fitted floor.
findingwhat it says
GridFloor SOTA (multi-seed)1.03% MAPE — near the empirical floor from our scaling sweep (best observed cell 1.061%; fitted asymptote ≈ 0.85% under a Chinchilla-style additive form — directional, not a calibrated CI).
Foundation models plateauChronos-2, Sundial, TabPFN-TS all stuck at 3–4%. They're univariate; can't encode cross-BA structure.
Accuracy → operational valueThe forecast captures 86–89% of perfect-foresight battery-dispatch value above 24h-persistence, under both peak-shaving and price arbitrage.
Only data moves itModel size + epochs are dead levers; data is the only one with slope, and same-distribution data is nearly exhausted.

What we actually built (not an off-the-shelf model)

The model architecture is in the iTransformer family, but the substantive contributions live around it. The result number depends on every one of these:

  1. A permutation-invariant adaptation of the iTransformer. Each BA becomes a cross-attention token; instance-norm + quantile loss + static BA features form a recipe that generalizes to BAs never seen in training. That's what makes one-shot evaluation on 7 held-out BAs honest — and what foundation models structurally can't do.
  2. A scaling sweep specific to this task. 4×4 size×data sweep (18 cells) fit with the Chinchilla-style additive form. Directional, not Kaplan-grade — the sweep spans ~2 orders of magnitude in N and ~1 in D, which isn't enough to calibrate α, β, or E∞ tightly. What's robust: at the sizes we tested, the parameter axis is dead and the data axis still has slope. The FERC-714 XBRL extension (+29% data → −0.014pp MAPE) hit that direction, validating "data still pays" as a finding even though the sweep can't validate β as a number.
  3. FERC-714 XBRL data recovery. We wrote an XBRL parser to pull utility-level hourly demand from the messy FERC filings back to 2010 — recovering the only piece of in-distribution data that measurably moved the metric, exactly as the scaling law predicted.
  4. Anti-pollution discipline enforced in code. Assert gates in every loader, pretest-only normalization, one-shot temporal split, multi-seed paired-bootstrap on every headline gain. This protocol — not the model — is what caught BA-mixup, the L80 over-claim, and the "MAPE is operationally useless" mirage.
  5. An operational-value framework. A battery-dispatch simulator under two objectives (peak-shaving + price arbitrage on real ERCOT 2024 hub prices) that translates MAPE into the units operators actually care about. Refuted our own most exciting interim finding.
  6. Controlled head-to-head re-running the competitor. We re-trained PatchTST and a plain iTransformer ourselves on the identical Hong & Lee 2026 split, with a faithfulness check (our PatchTST reproduces their published 3.66 to 0.02pp). The comparison stops depending on which published table their competitor number lived in. Against plain iTransformer (no perm-invariance, no instance-norm, no quantile loss) on this controlled ISO eval our recipe wins 3.52 vs 3.69 macro, CI [−0.196, −0.142] excludes zero — that's the closest direct ablation of "the recipe beyond an off-the-shelf iTransformer." For the headline held-out-BA task, plain iTransformer can't even run (positional embeddings tie each variate to a position; no slot for a never-seen BA at test time) — perm-invariance there is a prerequisite, not an ablation knob.
  7. An EIA-930 telemetry-artifact audit. Caught that LDWP's "3% MAPE" was the meter, not the model — pretest-derived cleaning mask, documented as a dataset issue not a modeling one.
  8. A documented graveyard. 13 negatives — BA-mixup, MoE, foundation-model stacks, GridLAB-D synth v1/v2, pre-2018 EIA, interchange + fuel-mix channels, weather covariates — each with a why. Negative results are the rare-to-publish artifact and the most useful single output of this project for anyone working the same problem.

The model alone is ~200 lines of PyTorch. The other 90% of the work is what makes the 1.02% defensible.

How it's measured

The headline metric is MAPE on observed-positive hours, macro-averaged across the 7 held-out BAs, on a one-shot forward window (Nov 16 2024 + 2025-Feb–Dec). 8 BAs are entirely absent from training, so the metric measures genuine out-of-distribution generalization — not memorization on data the model has seen.

How we estimated the floor — scaling sweep, not a Kaplan-grade law

We ran an 18-cell scaling sweep at 4 model sizes × 4 data scales (16 cells at 20 epochs) + 2 compute-corrected 40-epoch cells. To each set of 18 MAPE numbers we fit the Chinchilla-style additive form MAPE(N, D) = E∞ + A·N⁻ᵅ + B·D⁻ᵝ. The fit gives E∞ ≈ 0.85% (range 0.84–0.91 across slice fits) — a suggestive floor, not a calibrated one. The shape of the surface is what's load-bearing here, not the parameter values.

Each line is one model size across data scales · ★ = 40-epoch (compute-corrected) cell · dashed red = fitted floor E∞ ≈ 0.85% — directional, not a calibrated CI.

Two readings the picture supports (qualitatively, not quantitatively):

What this sweep is and isn't. A Kaplan/Hoffmann-grade scaling law needs ≥4 orders of magnitude in N, ≥3 orders in D, isoFLOPs compute-optimal sweeps, ≥3 seeds per cell, per-N HP tuning, and fit on NLL not MAPE. Ours covers ~2 orders in N (83K → 7.2M, 86×) and ~1 order in D (8×), with single-seed cells at most points and one shared HP schedule. Treat the surface as directional evidence — "data axis still pays, size axis doesn't at these sizes" — rather than as a calibrated estimate of α, β, or E∞. The L × 40ep cell at 1.061% is the lowest we ever observed; the parametric extrapolation says there's a bit more room below it, but the headroom number is not tight.

Why we test on two dispatch modes

The "does accuracy matter" claim is much stronger if it holds under two different uses of the forecast — they stress it in different ways. We test both:

modewhat the battery doesneeds from forecastpersistence floorforecast captures above floor
peak-shavingdischarge during the daily demand peak to chop the monthly demand-charge bill (real money for industrial customers + utilities deferring transmission upgrades)which hour will be the peak — a single hour that moves day-to-day with weather~51% — persistence misses shifted peaks (heat dome shifts peak from 5pm to 7pm; persistence still discharges at 5pm)88.6% of perfect-foresight value
price arbitragebuy electricity cheap, sell expensive — ride wholesale LMP swings (a multi-billion-dollar real industry; ERCOT alone has ~5 GW of batteries chasing it)shape of tomorrow's whole-day hourly price curve (since price tracks load, good load forecast → good price forecast)~75% — daily price cycle is fairly regular, even repeat-yesterday captures the shape85.9% of perfect-foresight value

Different sensitivities: peak-shaving lives or dies on one hour's timing; arbitrage lives on the whole-day shape. If the forecast captured value under only one we'd have to caveat ("accuracy pays for this use case"). It captures 86–89% under both, so the claim "MAPE translates to operational value" is robust to which dispatch problem you actually care about.

Per-model dollar savings — three battery durations

We ran the dispatch sim across the foundation-model forecasts we'd already trained, at three battery durations, and converted to dollars per MW of installed battery per year. Tariff: $15/kW-month demand charge (midpoint of PG&E E-19 and ConEd SC9-II). Arbitrage: real ERCOT-2024 day-ahead-market hub prices. Why three durations: 2h dominates installed C&I storage today, 4h is the FERC Order 841 / CAISO RA reference, 8h is the long-duration / utility-scale regime.

signal2h total $/MW-yr4h total $/MW-yr8h total $/MW-yrΔ vs persist-24h (4h)
perfect foresight (ceiling)$93k$184k$246k
GridFloor (ours)$84k$177k$242k+$47k
Chronos-2 fine-tuned$73k$151k$221k+$21k
Chronos-2 zero-shot$69k$149k$220k+$19k
Sundial$58k$128k$209k−$2k
persist-24h$58k$130k$193k
persist-168h$58k$126k$188k−$4k

Bench: 49 EIA-930 BAs ex-LDWP · 2025 Feb-Dec · 15,974 BA-days × 3 durations · 10% of mean BA demand · branch feature/dispatch-bench.

The Sundial monotone break is duration-dependent — and that's the real finding. You'd expect lower MAPE → higher dollar savings. Sundial has lower MAPE than persist-24h (4.85% vs 6.22%) at every duration, but on peak-shaving dollars: So it's not a 4-hour quirk: the break persists wherever peak-hour timing matters, and only vanishes once the battery is long enough that shoulder hours cover up Sundial's tail-day mis-calls. Tail-day accuracy beats mean MAPE at every duration short enough to actually depend on timing. The percent-capture aggregate hides this — only the dollar conversion surfaces it.

$/MW-year is the academic-standard unit but it's the wrong unit for "is this a lot." Anchored to actual fleet sizes at the 4-hour reference:

scopescalevs persist-24h /yrvs best foundation model /yr
per MW-year1 MW (smallest C&I unit)+$47k+$26k
typical utility BESS100 MW per system+$4.7M+$2.6M
big utility fleet1 GW deployed (PG&E-class)+$47M+$26M
US grid-scale storage EOY 2025~50 GW (EIA)+$2.4B+$1.3B
2030 projected (EIA mid-case)~200 GW+$9.4B+$5.2B
These numbers are a floor, not a ceiling. Real grid-scale BESS revenue stacks 4–5 streams; we modeled only the 2 least forecast-sensitive ones:
revenue streammodeled?forecast-sensitive?
energy arbitrage ($/kWh price-curve trading)✓ yes — ERCOT 2024 hub pricesmedium
demand-charge mgmt ($/kW-month peak shaving)✓ yes — $15/kW-mo midpointmedium
capacity / RA payments ($/kW-month for availability)✗ nohigh (mis-bid = penalty)
frequency regulation / ancillary services✗ nohigh (often dominates ERCOT BESS revenue)
wholesale energy in ancillary-cleared hours✗ nohigh

Items 3+4 typically exceed arbitrage in ERCOT today and are more forecast-sensitive (RegUp mis-bids penalize hard, where arbitrage just gets a worse fill). So GridFloor's true uplift across the full revenue stack is larger than the table — by how much is the next thing to model.

Operational takeaway in absolute dollars: GridFloor adds ~$25k (2h) / ~$47k (4h) / ~$49k (8h) per MW-year over persistence on the 2 revenue streams we modeled, holding roughly constant in absolute dollars as duration grows even as percent-of-perfect collapses. At US grid-scale fleet size that's ~$2.4B/year over persistence and ~$1.3B/year over the best foundation model — and that's before adding the 3 unmodeled revenue streams where forecast accuracy matters more.

Glossary

termwhat it is
BA (balancing authority)Entity that physically balances supply & demand on a chunk of the US grid. ~66 total. Examples: CISO (California ISO), PJM, ERCO (Texas), MISO, BPAT (Pacific NW), DUK (Carolinas), LDWP (LA Dept. of Water & Power).
EIA-930The federal hourly grid feed. For every BA, every hour: demand, the BA's own published day-ahead forecast, generation, fuel mix, interchange. The source dataset.
day-ahead load forecastPrediction made today for tomorrow's hourly demand. The series operators actually schedule generation against.
MAPEMean Absolute Percentage Error. "1% MAPE" means the average hourly miss is 1% of the true value.
incumbent forecastEach BA's own published day-ahead forecast (also in EIA-930). The baseline that matters — what utilities actually dispatch against today.
held-out BAs8 BAs (SWPP, DUK, BPAT, TVA, AZPS, NEVP, PSEI, LDWP) the model never sees in training. The headline 1.02% is on 7 — LDWP is excluded because of EIA-930 telemetry dropouts (~0.13% of hours show bogus near-zero readings; we document it + ship a cleaning rule, but keep it out of the headline so a dataset artifact doesn't contaminate the metric).
iTransformerA normal time-series transformer (Informer, PatchTST, vanilla) treats each time step as a token and uses attention to model hour-to-hour dependencies within one series. The iTransformer (Liu et al. 2024) inverts this: each variate — for us, each BA's whole 168h history embedded into one vector — is a token, and attention models cross-BA dependencies (CAISO ↔ ERCOT ↔ PJM …). For load forecasting that's the dominant signal: heatwaves move across regions, weekend dips hit everyone at once. Foundation models miss it because they're univariate.
perm-invariant iTransformerOur recipe: order-of-BAs doesn't matter, so the trained model generalizes to BAs it never saw.
foundation modelLarge pre-trained time-series model (Chronos-2, Sundial, TimesFM, …). Univariate by design — sees one series at a time. The structural reason they can't compete here.
floor E∞The irreducible error this architecture asymptotes to as model size and data grow. Our 18-cell sweep fits ≈ 0.85% under a Chinchilla-style additive form, with the best observed cell at 1.061%. Treat as directional — the sweep doesn't span enough decades in N and D to calibrate a real scaling law.
dispatch valueHow much money / CO₂ a forecast saves a battery operator who schedules charge/discharge against it. Measured as a fraction of perfect-foresight value above 24h-persistence.
FERC (Federal Energy Regulatory Commission)US agency that regulates interstate wholesale electricity (and gas) markets — approves ISO rules, sets wholesale market structure, requires data filings. "FERC Order 841" is the 2018 ruling that forced ISOs to let battery storage participate in wholesale energy + ancillary markets, which is why 4-hour became the de-facto reference battery duration.
ERCOT (Electric Reliability Council of Texas)One of the 7 US ISOs and the only one whose grid is electrically isolated from the rest of the country (the "Texas interconnect"). Largest US ISO by hourly peak load (~85 GW). Runs an aggressive energy-only market with the deepest battery-storage deployment in the US (~15 GW EOY 2025). Used as our arbitrage price source because their day-ahead market data is the most publicly accessible.
LMP (Locational Marginal Price)The wholesale electricity price at a specific node or zone of an ISO grid, published hour-by-hour by the day-ahead and real-time markets. Varies by location because of transmission congestion and line losses. HB_BUSAVG (Hub Bus Average) is ERCOT's average across its hub nodes — what we use as a representative system-wide arbitrage signal in the dispatch sim.
FERC Form 714Annual FERC filing where every US balancing authority + planning area reports its hourly demand history. The only public source of utility-level hourly load data outside EIA-930 (which only starts in 2015). We extracted 2010–2024 hourly demand from this dataset to extend the training corpus by +29%.
XBRL (eXtensible Business Reporting Language)XML-based standard the SEC and FERC use for structured business + utility filings. FERC requires Form 714 in XBRL since 2010; we wrote a parser to pull hourly demand out of the XML tree so the data could be added to training. The only data source that empirically moved MAPE on hold-ex-LDWP (−0.014pp from +29% data, matching the scaling sweep's β ≈ 0.5 direction).
multi-seed / paired bootstrapConfirming a "win" by running multiple seeds and requiring the CI on the delta to exclude zero. Killed several apparent wins (e.g. BA-mixup).
SHIP / KEEP / PARTIAL / DROPVerdict tags on every experiment. SHIP = adopted, KEEP = real but minor, PARTIAL = improved but below gate, DROP = negative or within noise.

Why this matters in the real world

It's the metric utilities actually run against. Grid operators commit generation a day ahead against the load forecast. A 1.02 vs 9.04 gap on a 100 GW grid is ~1 GW less missed per hour — about one large power plant. Lower error means fewer reserve plants kept "just in case" → lower cost, lower emissions. The dispatch finding shows the gain isn't academic: 86–89% of the value perfect foresight would capture, above the 24h-persistence floor, under both peak-shaving and price arbitrage.

It bounds the field. The scaling-law floor at E∞ ≈ 0.85% says no amount of extra parameters or epochs gets lower. The remaining 0.1pp is a data-acquisition question, not a modeling one — and same-distribution data is nearly exhausted. So the honest next direction is structural (sub-BA zonal expansion, renewable-gen as a parallel target) rather than bigger models.

What's still open