← GridFloor
Research directions — the human in the loop
the specific calls, ideas, and references that steered the autonomous agents · what each one changed
The agents did the mechanical work; the direction was human. A few of these calls didn't just tweak a knob — they changed what counted as a result, and one of them produced the strongest finding in the project. The ⭑ entries are the ones that mattered most.
The calls that shaped the result
⭑ verification
multi-seed everything before we call it a win — single-seed deltas here are inside noise. i want real signal, not incremental noise
Pushed back on a single-seed "win." Turned the protocol from "run it once" into multi-seed paired-bootstrap confirmation for every headline claim.
→ Caught the BA-mixup mirage: a single-seed +0.05pp "improvement" was a 20-vs-40-epoch compute confound. Refuted across 5 seeds. Without this push it would have shipped as a real gain.
⭑ reproducibility
recompute that dispatch finding from a committed script before we believe it — nothing counts til it reproduces
Insisted no result counts until reproduced from a committed script — not an agent's transcript.
→ Refuted our own most exciting claim. An agent reported "MAPE is operationally worthless (0.8% of dispatch value)." The committed recompute reversed it: forecast captures 86–89% of dispatch value. The flashy version survived only while uncommitted.
⭑ fair comparison
head-to-head vs prev sota on the EXACT same eval slice — same BAs, same window, same metric. no methodology mismatch
Demanded apples-to-apples vs prior SOTA. This reframed the comparison away from cross-paper MAPE (different BAs/horizon/year/metric) toward the real incumbent.
→ The strongest result in the project. The incumbent is each BA's own operational day-ahead forecast (EIA-930). Same 7 of 8 holdout BAs (LDWP held out from training too, but excluded from the headline due to EIA-930 telemetry artifacts), same window, same metric: 1.02% vs 9.04%, winning on every BA.
data
find more in-distribution data — where can we actually pull useful BA-hours from. scaling law says data's the only lever left
Refocused effort from architecture knobs onto the only lever the scaling law said still pays — in-distribution data.
→ Drove sub-BA zonal, cross-BA interchange, fuel-mix, ENTSO-E, FERC-714 XBRL. Most failed (distribution shift / redundancy), but XBRL delivered the predicted +0.014pp — confirming β ≈ 0.5.
synthetic data
push synthetic data thru gridlab-d properly (it's the namesake) — then make it better: real per-year weather + btm-pv, see if fidelity crosses the bar to actually help
Pushed the project's namesake direction: physics-simulated demand as training data. Then pushed a v2 with real weather + BTM-PV inventory.
→ Honest negatives: synth-vs-real hourly error 26% → 19%, still above the ~15% needed to add signal. The shape/phase ceiling is structural, not an inputs problem.
budget
don't budget-gate experiments — spend more on modal if it makes the result clean. that should be known, it's key
Removed the budget gate, so experiments ran at proper scale (multi-seed, A100 fine-tunes, full-slice foundation inference) instead of being descoped.
→ Enabled the matched-compute multi-seed runs that refuted mixup and the L80 over-claim. Total spend stayed modest (~$60–70 Modal).
conditioning
could per-ba ACI just be a prompt-token / conditioning so everything stays one model instead of a separate post-hoc state machine?
A sharp architectural idea — fold the per-BA calibration into the model as conditioning rather than a separate post-hoc state machine.
→ Mapped the design space (trained per-BA bias adapter / in-context residual prompting / static gating) and why online ACI works: it uses test-time feedback a purely-trained conditioner can't see.
weather
are we using weather models rn? if not — does weather sharpen peak-timing even if it's flat on mape?
Surfaced that exogenous weather was tested and dropped (≤0.01pp on MAPE) — and connected it to the dispatch finding.
→ Tested directly: weather is flat on peak-hour hit-rate and on dispatch value (paired CIs include zero) — for load this panel's cross-BA history already encodes the weather signal. Clean negative.
References & repos you pointed to
| reference | how it was used |
iTransformer Liu et al. 2024, arXiv:2310.06625 | The architecture. Its variate-as-token + multivariate attention is the structural reason foundation models can't compete here. |
Hong & Lee 2026 arXiv:2602.21415 | The closest published benchmark. We re-ran the recipe (and their baselines) on their exact ISO-aggregate protocol for the controlled head-to-head. |
EIA-930 · FERC-714 · ENTSO-E · NREL ResStock · ISO LMP (ERCOT/CAISO/NYISO) | Data sources you directed the agents to scrape and parse. EIA-930's native forecast column became the incumbent-SOTA comparison; ERCOT LMP calibrated the dispatch price model. |
| ENTSO-E API token | You provided the credential directly, unblocking the European-demand expansion harness that had been waiting on it. |
The pattern
The throughline across the ⭑ calls is the same instinct: distrust the convenient result. "Is this above noise?", "recompute it," "match the eval to prior SOTA" — each one demanded that a claim survive a harder test. That's exactly what an autonomous-agent loop lacks on its own, and it's what turned a pile of experiments into a defensible result. The agents supplied breadth and throughput; the direction supplied the skepticism.
Quotes are paraphrased from the working session. Every outcome links to a committed branch — see the flowchart node drawers for the exact scripts.