← GridFloor

How GridFloor compares to the literature

a comparative survey · ~25 external / operational results, web-verified · honest about what's actually comparable

"State of the art" is meaningless without a shared evaluation. The load-forecasting literature reports MAPE across wildly different test sets, regions, horizons, years, and even metric definitions — so a raw number-vs-number ranking would be apples-to-oranges. Below, every entry is tagged by how directly comparable it is to our exact protocol (EIA-930, 7 of 8 held-out US BAs — "hold-ex-LDWP" — 2025-Feb–Dec, day-ahead, one-shot, MAPE on observed>0).

Verdict. On the directly-comparable subset, GridFloor's 1.02% comes in below every external day-ahead result we could verify — under a harder setting (small, noisy held-out BAs, scored one-shot, vs the literature's larger, smoother ISO-aggregates trained per-region). The best comparable neural forecasters on the same EIA-930 data sit at ~2.4–6.4% per grid; operational ISO day-ahead forecasts cluster ~1.5–3%. We do not claim universal SOTA — the metric/aggregation incomparability makes that unfalsifiable — but the result is strong and honestly situated.
The most telling gap: we found no peer-reviewed paper that benchmarks BA-level (not ISO-aggregate) day-ahead EIA-930 load with MAPE the way we do. The closest, Hong & Lee (2026), works on ISO-aggregates. That absence is exactly why our primary baseline is each BA's own operational forecast on the identical window — not a paper.
Controlled head-to-head. Rather than just note the gap, we re-trained PatchTST and a plain iTransformer ourselves and scored all three on one identical window (5,237 origins, L=240, their in-distribution 70/15/15 split, day-ahead W=24), removing the cross-window confound. Faithfulness check: our PatchTST reproduces their published 3.66% (we get 3.64%), so the harness isn't a strawman; our fair iTransformer gets 3.69%, ~1pp better than their published 4.66% (theirs was undertuned), so we discard the inflated "win." Result: GridFloor is lowest at 3.52% and beats both with paired bootstrap CIs excluding zero (−0.12pp vs PatchTST, −0.17pp vs iTransformer) — statistically resolved but small margins. Consistently the best of three by a slim margin against well-tuned baselines, not a blowout.

Why are there four different metrics at all?

Before staring at the table: the reason no single number is "the SOTA" for load forecasting is that the literature lives in four overlapping communities, and each picks its metric to answer a different question. Knowing which family a paper belongs to is most of how you read it.

familytypical metricwhat it answerswhy this metric
Operations / utilities
EIA-930, ISO reports, FERC, NREL
MAPE, peak-period MAPE, $/MW-yr"How much money or dispatch error will this cost us?"MAPE is scale-free and intuitive ("we miss by 1%"). Operators care about percent error on actual operating quantities. Breaks at y≈0 — fine for total BA load, breaks for net load and solar.
Forecasting / stats
Hyndman lineage, M-competition, IJF
MASE, sMAPE"Is this model generally better than a sensible naive baseline, averaged across many series?"MASE = your error ÷ seasonal-naive error on the same series. Designed for averaging across series of wildly different scale — the only thing that makes sense when benchmarking on 42 datasets.
Deep-learning forecasting
ETT / ECL / Traffic benchmarks
MSE / MAE"Did my architecture move the leaderboard on the established benchmark?"Inherited from vision/NeurIPS-style benchmarks. Nobody thinks MSE is "right" for load — it's leaderboard inertia so numbers stay comparable across architecture papers (Informer → Autoformer → PatchTST → iTransformer).
Probabilistic forecasting
DeepAR, Chronos, Lag-Llama, GEFCom2014
Pinball loss / WQL / CRPS"How well-calibrated is the whole predictive distribution, not just the point?"If your model outputs a distribution, point MAPE makes no sense — it ignores the spread. These score the full quantile/density. Different question entirely.

The tier system below ("self / yes / partial / no") isn't ducking comparison — it's reflecting that a foundation-model paper reporting MASE on 42 datasets and a utility paper reporting MAPE on one ISO are not asking the same question. Forcing a single ranking number across all four would mean conflating four optimization targets.

What we did about it. Since the metric is a post-hoc choice on a fixed prediction file, we rescored GridFloor's own predictions in every family's units below — that converts "no comparison possible" into "comparison possible but with caveats."

GridFloor in everyone else's units

Same predictions, same hold-ex-LDWP slice, recomputed under the metrics other papers use. Lets us at least place a GridFloor number alongside Chronos / TimesFM / Moirai numbers — with explicit honesty about what's incomparable.

metricGridFloorfamilywhat it tells us
MAPE (%)1.02operationsheadline number; what utilities care about
median APE (%)0.66operationstypical-day error; lower than MAPE means tail days drive the mean
sMAPE (%)1.04operations / statssymmetric variant; near-identical to MAPE here (no extreme bias)
MASE (m=24)0.19forecasting / statsbeats seasonal-naive by a factor of 5×
nMAE (%)0.98operationsMAPE-like without the y→0 fragility
WRMSE (%)2.00operationsRMSE/mean; sensitive to tail hours (NEVP drives it up to 5.71%)
WQL (q=.1/.5/.9)0.0067probabilisticplaceholder — GridFloor is point-only; this is the post-hoc symmetric envelope

Directly comparable — same metric, horizon, & data family

Context only — different metric / dataset (not a fair head-to-head)

The canonical architectures and foundation models below report MASE / WQL / pinball / nMAE / MSE on ETT / Monash / system-load datasets — strong work, but not MAPE on held-out US BAs, so they can't be ranked against us directly. Listed for completeness.