gridfloor

the specific calls, ideas, and references that steered the autonomous agents · what each one changed

The agents did the mechanical work; the direction was human. The ⭑ entries changed the protocol or reframed what counted as a result, the rest are knob-level steers.

The calls that shaped the result

⭑ verification

multi-seed everything before we call it a win, single-seed deltas here are inside noise. i want real signal, not incremental noise

Pushed back on a single-seed "win." Turned the protocol from "run it once" into multi-seed paired-bootstrap confirmation for every headline claim.

→ Caught the BA-mixup mirage: a single-seed +0.05pp "improvement" was a 20-vs-40-epoch compute confound. Refuted across 5 seeds. Without this push it would have shipped as a real gain.

⭑ reproducibility

recompute that dispatch finding from a committed script before we believe it, nothing counts til it reproduces

Insisted no result counts until reproduced from a committed script, not an agent's transcript.

→ Refuted our own most exciting claim. An agent reported "MAPE is operationally worthless (0.8% of dispatch value)." The committed recompute reversed it: forecast captures 86–89% of dispatch value. The flashy version survived only while uncommitted.

⭑ fair comparison

head-to-head vs prev sota on the EXACT same eval slice, same BAs, same window, same metric. no methodology mismatch

Demanded apples-to-apples vs prior SOTA. This reframed the comparison away from cross-paper MAPE (different BAs/horizon/year/metric) toward the real incumbent.

→ The strongest result in the project. The incumbent is each BA's own operational day-ahead forecast (EIA-930). Same 7 of 8 holdout BAs (LDWP held out from training too, but excluded from the headline due to EIA-930 telemetry artifacts), same window, same metric: 1.02% vs 9.04%, winning on every BA.

data

find more in-distribution data, where can we actually pull useful BA-hours from. scaling law says data's the only lever left

Refocused effort from architecture knobs onto the only lever the scaling law said still pays, in-distribution data.

→ Drove sub-BA zonal, cross-BA interchange, fuel-mix, ENTSO-E, FERC-714 XBRL. Most failed (distribution shift / redundancy), but XBRL delivered the predicted +0.014pp, confirming β ≈ 0.5.

synthetic data

push synthetic data thru gridlab-d properly (it's the namesake), then make it better: real per-year weather + btm-pv, see if fidelity crosses the bar to actually help

Pushed the project's namesake direction: physics-simulated demand as training data. Then pushed a v2 with real weather + BTM-PV inventory.

→ Honest negatives: synth-vs-real hourly error 26% → 19%, still above the ~15% needed to add signal. The shape/phase ceiling is structural, not an inputs problem.

budget

don't budget-gate experiments, spend more on modal if it makes the result clean. that should be known, it's key

Removed the budget gate, so experiments ran at proper scale (multi-seed, A100 fine-tunes, full-slice foundation inference) instead of being descoped.

→ Enabled the matched-compute multi-seed runs that refuted mixup and the L80 over-claim. Total spend stayed modest (~$60–70 Modal).

conditioning

could per-ba ACI just be a prompt-token / conditioning so everything stays one model instead of a separate post-hoc state machine?

Architectural reframe: fold the per-BA calibration into the model as conditioning rather than a separate post-hoc state machine.

→ Mapped the design space (trained per-BA bias adapter / in-context residual prompting / static gating) and why online ACI works: it uses test-time feedback a purely-trained conditioner can't see.

weather

are we using weather models rn? if not, does weather sharpen peak-timing even if it's flat on mape?

Surfaced that exogenous weather was tested and dropped (≤0.01pp on MAPE), and connected it to the dispatch finding.

→ Tested directly: weather is flat on peak-hour hit-rate and on dispatch value (paired CIs include zero), for load this panel's cross-BA history already encodes the weather signal. Clean negative.

References & repos you pointed to

The pattern

The throughline across the ⭑ calls is the same instinct: distrust the convenient result. "Is this above noise?", "recompute it," "match the eval to prior SOTA", each one demanded that a claim survive a harder test, which is what an autonomous-agent loop lacks on its own. The agents supplied breadth and throughput; the direction supplied the skepticism.

Why a 1.02% load forecast isn't already in production everywhere

A natural follow-on question: if GridFloor beats the operational incumbent forecast by ~9× on MAPE, why hasn't that delta already propagated to utilities? The reasons below are mostly structural rather than technical (incentive misalignment, policy lock-in, the research-to-production gap).

reference	how it was used
iTransformer Liu et al. 2024, arXiv:2310.06625	The architecture. Its variate-as-token + multivariate attention is the structural reason foundation models can't compete here.
Hong & Lee 2026 arXiv:2602.21415	The closest published benchmark. We re-ran the recipe (and their baselines) on their exact ISO-aggregate protocol for the controlled head-to-head.
EIA-930 · FERC-714 · ENTSO-E · NREL ResStock · ISO LMP (ERCOT/CAISO/NYISO)	Data sources you directed the agents to scrape and parse. EIA-930's native forecast column became the incumbent-SOTA comparison; ERCOT LMP calibrated the dispatch price model.
ENTSO-E API token	You provided the credential directly, unblocking the European-demand expansion harness that had been waiting on it.

incentive misalignment

the entity that publishes the forecast and the entity that benefits from it being accurate are not the same

The ~$1.9B/yr dispatch value accrues to battery operators dispatching against the forecast, not to the utility that publishes it to EIA-930. Utilities are regulated entities, better load forecasting doesn't raise their rate of return. There's no market where "BA day-ahead forecast quality" is a traded service, so the value sits stranded between parties who don't transact.

policy lock-in

NERC operating-reserve standards are set as fixed percentages above forecast demand, not dynamically scaled to forecast error

The whole-grid "less reserves → fewer peakers → lower wholesale prices and emissions" prize is real but locked behind a policy lever that doesn't update with forecast quality. A BA improving from 9% to 1% MAPE doesn't immediately get to hold less reserve. Until reserve standards become dynamic, that ~$3–8B/yr in unrealized whole-grid value is invisible to the entities that could capture it.

already happening (just not here)

battery operators with the most direct exposure to forecast quality build it in-house and don't share

Tesla Autobidder, Habitat Energy, Fluence, Form Energy, Modo Energy clients, and a stack of IPP-side trading desks all run in-house load + price forecasting on the same EIA-930 + private weather data. Modo's market reports show operators with above-median forecasting capture ~$10–30k/MW-yr more than peers, they build it themselves because (a) it's their direct revenue lever, (b) leaking it to competitors is a moat problem, and (c) "buy a forecast" doesn't exist at the quality tier that matters. So the value is being partially captured, just not via published academic models.

where big tech actually is in this space

it's not OpenAI / Anthropic, those companies are chat-and-code shops. it's Google / DeepMind

Google Research shipped TimesFM (the foundation model we benchmark against on the literature page); DeepMind published the data-center cooling RL work and the wind-farm value forecasting result. Salesforce shipped Moirai. Microsoft has Azure energy-vertical product partnerships (Schneider Electric, etc.). National labs, NREL, LBNL, PNNL, have decades of forecasting work; the NREL CNN load forecaster sits in our literature table. OpenAI and Anthropic don't have grid-energy products or research initiatives, generic API customers in utilities aren't the same as targeted work.

research-to-production engineering gap

a model that hits 1.02% MAPE in a /tmp parquet is not a deployable forecasting service

Productionization needs the unglamorous stack: daily inference pipeline with SLAs (forecast available by 5am every day or a trading desk misses bid windows), graceful handling of EIA-930 data dropouts (LDWP-style telemetry artifacts happen on other BAs too), model-drift detection + automated retraining, fallback to a simpler model when the perm-iTransformer output looks anomalous, integration with whatever bidding stack the operator uses (ISO-specific API quirks). That's a 6–12 month engineering project that's not interesting as research, isn't capitalized by research-funding bodies, and is invisible to academic incentive structures.

the published 9.04% isn't always what's used

some BAs publish a stale auto-generated forecast for FERC compliance; their internal operational forecast can be better

Honesty check on our own headline. The 9.04% is each BA's day-ahead forecast as filed to EIA-930. Inspection suggests some BAs publish a stale auto-generated forecast purely for compliance, their internal operational system can be much better (3–5%). Our 9× win is real against the published baseline but is closer to 2–3× against what some sophisticated operators internally use. Looking at the clean BAs (BPAT, DUK, TVA at ~2–2.5% incumbent) is the operator-comparison floor; we still win 2–3× there, but it's not 9×.

GridFloor's contribution as a research artifact is to show what's possible at the floor, under a controlled eval, with one architecture choice that solves a structural foundation-model gap. Whether that floor gets adopted in production depends on (a) policy lever updates (NERC reserve standards), (b) the buy-vs-build calculus at battery operators, and (c) someone writing the productionization layer. None of those are research questions.

Quotes are paraphrased from the working session. Every outcome links to a committed branch, see the flowchart node drawers for the exact scripts.

Research directions

The calls that shaped the result

References & repos you pointed to

The pattern

Why a 1.02% load forecast isn't already in production everywhere