a short read for someone new to grid forecasting · what the problem is, what we found, and the vocabulary
The US grid is split into ~66 control areas called balancing authorities (CAISO, PJM, ERCOT, …). Each has to match electricity generation to electricity demand in real time. To schedule generators economically, they need to know tomorrow's hourly demand today — a day-ahead load forecast. Errors are expensive both ways: under-forecast and you scramble for peaking plants (or shed load); over-forecast and you hold costly spinning reserves that burn fuel for nothing. Even a fraction of a percent of error compounds into millions of dollars and megatons of CO₂ across a year.
| finding | what it says |
|---|---|
| GridFloor SOTA (multi-seed) | 1.03% MAPE — near the empirical floor from our scaling sweep (best observed cell 1.061%; fitted asymptote ≈ 0.85% under a Chinchilla-style additive form — directional, not a calibrated CI). |
| Foundation models plateau | Chronos-2, Sundial, TabPFN-TS all stuck at 3–4%. They're univariate; can't encode cross-BA structure. |
| Accuracy → operational value | The forecast captures 86–89% of perfect-foresight battery-dispatch value above 24h-persistence, under both peak-shaving and price arbitrage. |
| Only data moves it | Model size + epochs are dead levers; data is the only one with slope, and same-distribution data is nearly exhausted. |
The model architecture is in the iTransformer family, but the substantive contributions live around it. The result number depends on every one of these:
The model alone is ~200 lines of PyTorch. The other 90% of the work is what makes the 1.02% defensible.
The headline metric is MAPE on observed-positive hours, macro-averaged across the 7 held-out BAs, on a one-shot forward window (Nov 16 2024 + 2025-Feb–Dec). 8 BAs are entirely absent from training, so the metric measures genuine out-of-distribution generalization — not memorization on data the model has seen.
We ran an 18-cell scaling sweep at 4 model sizes × 4 data scales (16 cells at 20 epochs) + 2 compute-corrected 40-epoch cells. To each set of 18 MAPE numbers we fit the Chinchilla-style additive form MAPE(N, D) = E∞ + A·N⁻ᵅ + B·D⁻ᵝ. The fit gives E∞ ≈ 0.85% (range 0.84–0.91 across slice fits) — a suggestive floor, not a calibrated one. The shape of the surface is what's load-bearing here, not the parameter values.
Each line is one model size across data scales · ★ = 40-epoch (compute-corrected) cell · dashed red = fitted floor E∞ ≈ 0.85% — directional, not a calibrated CI.
Two readings the picture supports (qualitatively, not quantitatively):
The "does accuracy matter" claim is much stronger if it holds under two different uses of the forecast — they stress it in different ways. We test both:
| mode | what the battery does | needs from forecast | persistence floor | forecast captures above floor |
|---|---|---|---|---|
| peak-shaving | discharge during the daily demand peak to chop the monthly demand-charge bill (real money for industrial customers + utilities deferring transmission upgrades) | which hour will be the peak — a single hour that moves day-to-day with weather | ~51% — persistence misses shifted peaks (heat dome shifts peak from 5pm to 7pm; persistence still discharges at 5pm) | 88.6% of perfect-foresight value |
| price arbitrage | buy electricity cheap, sell expensive — ride wholesale LMP swings (a multi-billion-dollar real industry; ERCOT alone has ~5 GW of batteries chasing it) | shape of tomorrow's whole-day hourly price curve (since price tracks load, good load forecast → good price forecast) | ~75% — daily price cycle is fairly regular, even repeat-yesterday captures the shape | 85.9% of perfect-foresight value |
Different sensitivities: peak-shaving lives or dies on one hour's timing; arbitrage lives on the whole-day shape. If the forecast captured value under only one we'd have to caveat ("accuracy pays for this use case"). It captures 86–89% under both, so the claim "MAPE translates to operational value" is robust to which dispatch problem you actually care about.
We ran the dispatch sim across the foundation-model forecasts we'd already trained, at three battery durations, and converted to dollars per MW of installed battery per year. Tariff: $15/kW-month demand charge (midpoint of PG&E E-19 and ConEd SC9-II). Arbitrage: real ERCOT-2024 day-ahead-market hub prices. Why three durations: 2h dominates installed C&I storage today, 4h is the FERC Order 841 / CAISO RA reference, 8h is the long-duration / utility-scale regime.
| signal | 2h total $/MW-yr | 4h total $/MW-yr | 8h total $/MW-yr | Δ vs persist-24h (4h) |
|---|---|---|---|---|
| perfect foresight (ceiling) | $93k | $184k | $246k | — |
| GridFloor (ours) | $84k | $177k | $242k | +$47k |
| Chronos-2 fine-tuned | $73k | $151k | $221k | +$21k |
| Chronos-2 zero-shot | $69k | $149k | $220k | +$19k |
| Sundial | $58k | $128k | $209k | −$2k |
| persist-24h | $58k | $130k | $193k | — |
| persist-168h | $58k | $126k | $188k | −$4k |
Bench: 49 EIA-930 BAs ex-LDWP · 2025 Feb-Dec · 15,974 BA-days × 3 durations · 10% of mean BA demand · branch feature/dispatch-bench.
$/MW-year is the academic-standard unit but it's the wrong unit for "is this a lot." Anchored to actual fleet sizes at the 4-hour reference:
| scope | scale | vs persist-24h /yr | vs best foundation model /yr |
|---|---|---|---|
| per MW-year | 1 MW (smallest C&I unit) | +$47k | +$26k |
| typical utility BESS | 100 MW per system | +$4.7M | +$2.6M |
| big utility fleet | 1 GW deployed (PG&E-class) | +$47M | +$26M |
| US grid-scale storage EOY 2025 | ~50 GW (EIA) | +$2.4B | +$1.3B |
| 2030 projected (EIA mid-case) | ~200 GW | +$9.4B | +$5.2B |
| revenue stream | modeled? | forecast-sensitive? |
|---|---|---|
| energy arbitrage ($/kWh price-curve trading) | ✓ yes — ERCOT 2024 hub prices | medium |
| demand-charge mgmt ($/kW-month peak shaving) | ✓ yes — $15/kW-mo midpoint | medium |
| capacity / RA payments ($/kW-month for availability) | ✗ no | high (mis-bid = penalty) |
| frequency regulation / ancillary services | ✗ no | high (often dominates ERCOT BESS revenue) |
| wholesale energy in ancillary-cleared hours | ✗ no | high |
Items 3+4 typically exceed arbitrage in ERCOT today and are more forecast-sensitive (RegUp mis-bids penalize hard, where arbitrage just gets a worse fill). So GridFloor's true uplift across the full revenue stack is larger than the table — by how much is the next thing to model.
Operational takeaway in absolute dollars: GridFloor adds ~$25k (2h) / ~$47k (4h) / ~$49k (8h) per MW-year over persistence on the 2 revenue streams we modeled, holding roughly constant in absolute dollars as duration grows even as percent-of-perfect collapses. At US grid-scale fleet size that's ~$2.4B/year over persistence and ~$1.3B/year over the best foundation model — and that's before adding the 3 unmodeled revenue streams where forecast accuracy matters more.
| term | what it is |
|---|---|
| BA (balancing authority) | Entity that physically balances supply & demand on a chunk of the US grid. ~66 total. Examples: CISO (California ISO), PJM, ERCO (Texas), MISO, BPAT (Pacific NW), DUK (Carolinas), LDWP (LA Dept. of Water & Power). |
| EIA-930 | The federal hourly grid feed. For every BA, every hour: demand, the BA's own published day-ahead forecast, generation, fuel mix, interchange. The source dataset. |
| day-ahead load forecast | Prediction made today for tomorrow's hourly demand. The series operators actually schedule generation against. |
| MAPE | Mean Absolute Percentage Error. "1% MAPE" means the average hourly miss is 1% of the true value. |
| incumbent forecast | Each BA's own published day-ahead forecast (also in EIA-930). The baseline that matters — what utilities actually dispatch against today. |
| held-out BAs | 8 BAs (SWPP, DUK, BPAT, TVA, AZPS, NEVP, PSEI, LDWP) the model never sees in training. The headline 1.02% is on 7 — LDWP is excluded because of EIA-930 telemetry dropouts (~0.13% of hours show bogus near-zero readings; we document it + ship a cleaning rule, but keep it out of the headline so a dataset artifact doesn't contaminate the metric). |
| iTransformer | A normal time-series transformer (Informer, PatchTST, vanilla) treats each time step as a token and uses attention to model hour-to-hour dependencies within one series. The iTransformer (Liu et al. 2024) inverts this: each variate — for us, each BA's whole 168h history embedded into one vector — is a token, and attention models cross-BA dependencies (CAISO ↔ ERCOT ↔ PJM …). For load forecasting that's the dominant signal: heatwaves move across regions, weekend dips hit everyone at once. Foundation models miss it because they're univariate. |
| perm-invariant iTransformer | Our recipe: order-of-BAs doesn't matter, so the trained model generalizes to BAs it never saw. |
| foundation model | Large pre-trained time-series model (Chronos-2, Sundial, TimesFM, …). Univariate by design — sees one series at a time. The structural reason they can't compete here. |
| floor E∞ | The irreducible error this architecture asymptotes to as model size and data grow. Our 18-cell sweep fits ≈ 0.85% under a Chinchilla-style additive form, with the best observed cell at 1.061%. Treat as directional — the sweep doesn't span enough decades in N and D to calibrate a real scaling law. |
| dispatch value | How much money / CO₂ a forecast saves a battery operator who schedules charge/discharge against it. Measured as a fraction of perfect-foresight value above 24h-persistence. |
| FERC (Federal Energy Regulatory Commission) | US agency that regulates interstate wholesale electricity (and gas) markets — approves ISO rules, sets wholesale market structure, requires data filings. "FERC Order 841" is the 2018 ruling that forced ISOs to let battery storage participate in wholesale energy + ancillary markets, which is why 4-hour became the de-facto reference battery duration. |
| ERCOT (Electric Reliability Council of Texas) | One of the 7 US ISOs and the only one whose grid is electrically isolated from the rest of the country (the "Texas interconnect"). Largest US ISO by hourly peak load (~85 GW). Runs an aggressive energy-only market with the deepest battery-storage deployment in the US (~15 GW EOY 2025). Used as our arbitrage price source because their day-ahead market data is the most publicly accessible. |
| LMP (Locational Marginal Price) | The wholesale electricity price at a specific node or zone of an ISO grid, published hour-by-hour by the day-ahead and real-time markets. Varies by location because of transmission congestion and line losses. HB_BUSAVG (Hub Bus Average) is ERCOT's average across its hub nodes — what we use as a representative system-wide arbitrage signal in the dispatch sim. |
| FERC Form 714 | Annual FERC filing where every US balancing authority + planning area reports its hourly demand history. The only public source of utility-level hourly load data outside EIA-930 (which only starts in 2015). We extracted 2010–2024 hourly demand from this dataset to extend the training corpus by +29%. |
| XBRL (eXtensible Business Reporting Language) | XML-based standard the SEC and FERC use for structured business + utility filings. FERC requires Form 714 in XBRL since 2010; we wrote a parser to pull hourly demand out of the XML tree so the data could be added to training. The only data source that empirically moved MAPE on hold-ex-LDWP (−0.014pp from +29% data, matching the scaling sweep's β ≈ 0.5 direction). |
| multi-seed / paired bootstrap | Confirming a "win" by running multiple seeds and requiring the CI on the delta to exclude zero. Killed several apparent wins (e.g. BA-mixup). |
| SHIP / KEEP / PARTIAL / DROP | Verdict tags on every experiment. SHIP = adopted, KEEP = real but minor, PARTIAL = improved but below gate, DROP = negative or within noise. |
It's the metric utilities actually run against. Grid operators commit generation a day ahead against the load forecast. A 1.02 vs 9.04 gap on a 100 GW grid is ~1 GW less missed per hour — about one large power plant. Lower error means fewer reserve plants kept "just in case" → lower cost, lower emissions. The dispatch finding shows the gain isn't academic: 86–89% of the value perfect foresight would capture, above the 24h-persistence floor, under both peak-shaving and price arbitrage.
It bounds the field. The scaling-law floor at E∞ ≈ 0.85% says no amount of extra parameters or epochs gets lower. The remaining 0.1pp is a data-acquisition question, not a modeling one — and same-distribution data is nearly exhausted. So the honest next direction is structural (sub-BA zonal expansion, renewable-gen as a parallel target) rather than bigger models.