How GridFloor compares to the literature

a comparative survey · 32 external / operational results, web-verified · honest about what's actually comparable

"State of the art" is meaningless without a shared evaluation. The load-forecasting literature reports MAPE across wildly different test sets, regions, horizons, years, and even metric definitions, so a raw number-vs-number ranking would be apples-to-oranges. Below, every entry is tagged by how directly comparable it is to our exact protocol (EIA-930, 7 of 8 held-out US BAs, "hold-ex-LDWP", 2025-Feb–Dec, day-ahead, one-shot, MAPE on observed>0).

Verdict. On the directly-comparable subset, GridFloor's 1.02% comes in below every external day-ahead result we could verify, under a harder setting (small, noisy held-out BAs, scored one-shot, vs the literature's larger, smoother ISO-aggregates trained per-region). The best comparable neural forecasters on the same EIA-930 data sit at ~2.4–6.4% per grid; operational ISO day-ahead forecasts cluster ~1.5–3%. We do not claim universal SOTA, the metric/aggregation incomparability makes that unfalsifiable, but the result is strong and honestly situated.

The most telling gap: we found no peer-reviewed paper that benchmarks BA-level (not ISO-aggregate) day-ahead EIA-930 load with MAPE the way we do. The closest, Hong & Lee (2026), works on ISO-aggregates. That absence is exactly why our primary baseline is each BA's own operational forecast on the identical window, not a paper.

Controlled head-to-head. Rather than just note the gap, we re-trained PatchTST and a plain iTransformer ourselves and scored all three on one identical window (5,237 origins, L=240, their in-distribution 70/15/15 split, day-ahead W=24), removing the cross-window confound. Faithfulness check: our PatchTST reproduces their published 3.66% (we get 3.64%), so the harness isn't a strawman; our fair iTransformer gets 3.69%, ~1pp better than their published 4.66% (theirs was undertuned), so we discard the inflated "win." Result: GridFloor is lowest at 3.52% and beats both with paired bootstrap CIs excluding zero (−0.12pp vs PatchTST, −0.17pp vs iTransformer), statistically resolved but small margins. Consistently the best of three by a slim margin against well-tuned baselines, not a blowout.

Why are there four different metrics at all?

Before staring at the table: the reason no single number is "the SOTA" for load forecasting is that the literature lives in four overlapping communities, and each picks its metric to answer a different question. Knowing which family a paper belongs to is most of how you read it.

family	typical metric	what it answers	why this metric
Operations / utilities EIA-930, ISO reports, FERC, NREL	MAPE, peak-period MAPE, $/MW-yr	"How much money or dispatch error will this cost us?"	MAPE is scale-free and intuitive ("we miss by 1%"). Operators care about percent error on actual operating quantities. Breaks at y≈0, fine for total BA load, breaks for net load and solar.
Forecasting / stats Hyndman lineage, M-competition, IJF	MASE, sMAPE	"Is this model generally better than a sensible naive baseline, averaged across many series?"	MASE = your error ÷ seasonal-naive error on the same series. Designed for averaging across series of wildly different scale, the only thing that makes sense when benchmarking on 42 datasets.
Deep-learning forecasting ETT / ECL / Traffic benchmarks	MSE / MAE	"Did my architecture move the leaderboard on the established benchmark?"	Inherited from vision/NeurIPS-style benchmarks. Nobody thinks MSE is "right" for load, it's leaderboard inertia so numbers stay comparable across architecture papers (Informer → Autoformer → PatchTST → iTransformer).
Probabilistic forecasting DeepAR, Chronos, Lag-Llama, GEFCom2014	Pinball loss / WQL / CRPS	"How well-calibrated is the whole predictive distribution, not just the point?"	If your model outputs a distribution, point MAPE makes no sense, it ignores the spread. These score the full quantile/density. Different question entirely.

The tier system below ("self / yes / partial / no") reflects that a foundation-model paper reporting MASE on 42 datasets and a utility paper reporting MAPE on one ISO are not asking the same question. Forcing a single ranking number across all four would mean conflating four optimization targets.

What we did about it. Since the metric is a post-hoc choice on a fixed prediction file, we rescored GridFloor's own predictions in every family's units below, that converts "no comparison possible" into "comparison possible but with caveats."

GridFloor in everyone else's units

Same predictions, same hold-ex-LDWP slice, recomputed under the metrics other papers use. Lets us at least place a GridFloor number alongside Chronos / TimesFM / Moirai numbers, with explicit honesty about what's incomparable.

metric	GridFloor	family	what it tells us
MAPE (%)	1.02	operations	headline number; what utilities care about
median APE (%)	0.66	operations	typical-day error; lower than MAPE means tail days drive the mean
sMAPE (%)	1.04	operations / stats	symmetric variant; near-identical to MAPE here (no extreme bias)
MASE (m=24)	0.19	forecasting / stats	beats seasonal-naive by a factor of 5×
nMAE (%)	0.98	operations	MAPE-like without the y→0 fragility
WRMSE (%)	2.00	operations	RMSE/mean; sensitive to tail hours (NEVP drives it up to 5.71%)
WQL (q=.1/.5/.9)	0.0067	probabilistic	placeholder, GridFloor is point-only; this is the post-hoc symmetric envelope

Directly comparable, same metric, horizon, & data family

Context only, different metric / dataset (not a fair head-to-head)

The canonical architectures and foundation models below report MASE / WQL / pinball / nMAE / MSE on ETT / Monash / system-load datasets, strong work, but not MAPE on held-out US BAs, so they can't be ranked against us directly. Listed for completeness.