Autoresearch Surrogate Loop¶
Status: Active Audience: Team, AI agents Purpose: Current surrogate-loop strategy and acceptance logic for autoresearch
Purpose¶
This document formalises the following pivot:
stop running
autoresearchdirectly against an end-to-end metric that is too costly;refocus the autonomous loop on a fusion-only surrogate loop;
reserve full-stack metrics for deferred validation of the best candidates.
Observed Problem¶
The autoresearch pattern is only worthwhile if it can chain many iterations
autonomously.
On this branch, even recent variants of fast metrics remain too expensive:
product-score: too slow for an autonomous loop;product-score-mid: still too costly for broad exploration;product-score-fast-fast: shorter, but still too long to target 50+ runs per session.
The real problem is therefore not just the metric. It is the granularity of the search:
we retrain the entire stack for a change that is often local;
we pay a full cost for a modification that only concerns the fusion;
the autonomous loop spends its budget on certification instead of exploring.
Decision¶
The next autoresearch mode must no longer be:
full stack -> full metric -> full guardrails
at every iteration.
It must become:
very cheap local surrogate -> many iterationsthen
deferred validationon the best candidates only.
Principle of the Fusion-Only Loop¶
Fusion is currently the best entry point for a surrogate loop, because:
its scope is bounded;
its changes are often local;
it consumes signals already produced by
headerandvalues;it can be retrained much faster if the right caches are put in place.
The idea is to freeze, for a given benchmark:
the splits;
the outputs of the
headerbranch;the outputs of the
valuesbranch;the target labels;
the base fusion features.
Then the autoresearch loop only retrains the fusion model, or a small
variant of its features, without redoing the full stack on every run.
What the Surrogate Loop Must Optimise¶
The local metric must not pretend to replace product truth. It must be:
fast;
stable;
sensitive to genuine fusion gains;
sufficiently correlated with stack validation to serve as a filter.
In this mode, we optimise a FusionSurrogateScore computed from pre-prepared folds.
This score should mainly reward:
the right
header/valuescombination;robustness on anonymous columns;
quality on
en_field;quality on
tropical_fieldandresearch_traitsif these examples are present in the cache;reduction of
statistic.countfalse positives on coded columns.
Caches to Produce¶
Speed depends on this.
For each frozen fold, we want to store:
train_recordstest_recordsheaderprobabilities aligned on all conceptsvaluesprobabilities aligned on all conceptsbase fusion meta-features
target labels
bucket metadata:
tropical_fieldresearch_traitsen_fieldgbif_core_standardanonymous
Recommended format:
data/cache/ml/fusion_surrogate/<cache_version>/fold_*.npzplus a
manifest.jsondescribing:gold set hash
concept list
protocol
feature version
Two Surrogate Levels¶
1. Ultra-Fast Surrogate¶
Goal:
explore broadly;
reject quickly;
accept that it is slightly approximate.
Recommended content:
a few fixed folds;
critical buckets only;
fusion-only retraining;
no header/values rerun.
2. Promotion Surrogate¶
Goal:
confirm the best candidates;
verify that a local fusion gain is not heading in a misleading direction.
Recommended content:
same cache, but more folds or more buckets;
still without retraining the entire stack;
higher cost than ultra-fast, but still well below the full
product-score.
Recommended Validation Chain¶
The new chain should become:
fusion-surrogate-fastfusion-surrogate-midproduct-score-fast-fastproduct-score-midproduct-scoreniamoto-score
Interpretation:
the first two levels serve autonomous exploration;
levels 3 to 6 serve promotion of genuine winners.
Recommended Acceptance Rule¶
A candidate can be:
candidatebeats
fusion-surrogate-fast
provisionalbeats
fusion-surrogate-mid
promotablebeats
product-score-fast-fast
validatedbeats
product-score-mid
certifiedbeats
product-scoredoes not break
niamoto-score
The important point:
the autonomous loop must not wait for
certifiedbefore continuing to run;it should primarily produce a queue of promising candidates.
Recommended Initial Scope¶
First recommended work:
fusion-onlyloop
Relevant files:
ml/scripts/train/train_fusion.pysrc/niamoto/core/imports/ml/classifier.pyfuture surrogate cache script
Not to include in the first loop:
alias registry
profiler rules
gold set
dashboard
product docs
Non-Goals¶
This pivot does not seek to:
replace real product validation;
hide global regressions;
remove
product-scoreorniamoto-score.
It seeks to:
make
autoresearchfinally practical;restore cadence to exploration;
clearly separate exploration from certification.
Current State¶
The minimal building blocks are now in place:
build the
fusion_surrogatecache via:
uv run python -m ml.scripts.research.build_fusion_surrogate_cache --gold-set ml/data/gold_set.json --splits 3
expose a command such as:
uv run python -m ml.scripts.eval.evaluate --model fusion --metric surrogate-fast
also expose:
uv run python -m ml.scripts.eval.evaluate --model fusion --metric surrogate-mid
run
autoresearchonly on these metricspromote only the best candidates to full stack validation
Local runner added:
uv run python -m ml.scripts.research.run_fusion_surrogate_autoresearch --iterations 50
Behaviour:
computes
surrogate-fast,surrogate-midbaselinesdefers
product-score-fast-fastuntil the first candidate that passessurrogate-midlaunches one
codexiteration per candidateevaluates gates itself
reverts losers
automatically commits winners to keep a clean worktree
writes a JSONL log under
.autoresearch/
First Measurements¶
On the current gold set:
fusion_surrogatecache build (splits=3):approximately
490s(~8m10s)
surrogate-faston warm cache:approximately
1.66s
surrogate-midon warm cache:approximately
1.31s
Reading:
the initial build remains a notable one-shot cost;
however, fusion-only runs are now fast enough to support a real
autoresearchloop.
Reference Decision¶
As long as this surrogate loop does not exist, full-stack autoresearch
remains structurally too slow to deliver its real value.