a comparative survey · ~25 external / operational results, web-verified · honest about what's actually comparable
"State of the art" is meaningless without a shared evaluation. The load-forecasting literature reports MAPE across wildly different test sets, regions, horizons, years, and even metric definitions — so a raw number-vs-number ranking would be apples-to-oranges. Below, every entry is tagged by how directly comparable it is to our exact protocol (EIA-930, 7 of 8 held-out US BAs — "hold-ex-LDWP" — 2025-Feb–Dec, day-ahead, one-shot, MAPE on observed>0).
Before staring at the table: the reason no single number is "the SOTA" for load forecasting is that the literature lives in four overlapping communities, and each picks its metric to answer a different question. Knowing which family a paper belongs to is most of how you read it.
| family | typical metric | what it answers | why this metric |
|---|---|---|---|
| Operations / utilities EIA-930, ISO reports, FERC, NREL | MAPE, peak-period MAPE, $/MW-yr | "How much money or dispatch error will this cost us?" | MAPE is scale-free and intuitive ("we miss by 1%"). Operators care about percent error on actual operating quantities. Breaks at y≈0 — fine for total BA load, breaks for net load and solar. |
| Forecasting / stats Hyndman lineage, M-competition, IJF | MASE, sMAPE | "Is this model generally better than a sensible naive baseline, averaged across many series?" | MASE = your error ÷ seasonal-naive error on the same series. Designed for averaging across series of wildly different scale — the only thing that makes sense when benchmarking on 42 datasets. |
| Deep-learning forecasting ETT / ECL / Traffic benchmarks | MSE / MAE | "Did my architecture move the leaderboard on the established benchmark?" | Inherited from vision/NeurIPS-style benchmarks. Nobody thinks MSE is "right" for load — it's leaderboard inertia so numbers stay comparable across architecture papers (Informer → Autoformer → PatchTST → iTransformer). |
| Probabilistic forecasting DeepAR, Chronos, Lag-Llama, GEFCom2014 | Pinball loss / WQL / CRPS | "How well-calibrated is the whole predictive distribution, not just the point?" | If your model outputs a distribution, point MAPE makes no sense — it ignores the spread. These score the full quantile/density. Different question entirely. |
The tier system below ("self / yes / partial / no") isn't ducking comparison — it's reflecting that a foundation-model paper reporting MASE on 42 datasets and a utility paper reporting MAPE on one ISO are not asking the same question. Forcing a single ranking number across all four would mean conflating four optimization targets.
What we did about it. Since the metric is a post-hoc choice on a fixed prediction file, we rescored GridFloor's own predictions in every family's units below — that converts "no comparison possible" into "comparison possible but with caveats."
Same predictions, same hold-ex-LDWP slice, recomputed under the metrics other papers use. Lets us at least place a GridFloor number alongside Chronos / TimesFM / Moirai numbers — with explicit honesty about what's incomparable.
| metric | GridFloor | family | what it tells us |
|---|---|---|---|
| MAPE (%) | 1.02 | operations | headline number; what utilities care about |
| median APE (%) | 0.66 | operations | typical-day error; lower than MAPE means tail days drive the mean |
| sMAPE (%) | 1.04 | operations / stats | symmetric variant; near-identical to MAPE here (no extreme bias) |
| MASE (m=24) | 0.19 | forecasting / stats | beats seasonal-naive by a factor of 5× |
| nMAE (%) | 0.98 | operations | MAPE-like without the y→0 fragility |
| WRMSE (%) | 2.00 | operations | RMSE/mean; sensitive to tail hours (NEVP drives it up to 5.71%) |
| WQL (q=.1/.5/.9) | 0.0067 | probabilistic | placeholder — GridFloor is point-only; this is the post-hoc symmetric envelope |
The canonical architectures and foundation models below report MASE / WQL / pinball / nMAE / MSE on ETT / Monash / system-load datasets — strong work, but not MAPE on held-out US BAs, so they can't be ranked against us directly. Listed for completeness.