Autoresearch Surrogate Loop¶

Status: Active Audience: Team, AI agents Purpose: Current surrogate-loop strategy and acceptance logic for autoresearch

Purpose¶

This document formalises the following pivot:

stop running autoresearch directly against an end-to-end metric that is too costly;
refocus the autonomous loop on a fusion-only surrogate loop;
reserve full-stack metrics for deferred validation of the best candidates.

Observed Problem¶

The autoresearch pattern is only worthwhile if it can chain many iterations autonomously.

On this branch, even recent variants of fast metrics remain too expensive:

product-score: too slow for an autonomous loop;
product-score-mid: still too costly for broad exploration;
product-score-fast-fast: shorter, but still too long to target 50+ runs per session.

The real problem is therefore not just the metric. It is the granularity of the search:

we retrain the entire stack for a change that is often local;
we pay a full cost for a modification that only concerns the fusion;
the autonomous loop spends its budget on certification instead of exploring.

Decision¶

The next autoresearch mode must no longer be:

full stack -> full metric -> full guardrails

at every iteration.

It must become:

very cheap local surrogate -> many iterations
then deferred validation on the best candidates only.

Principle of the Fusion-Only Loop¶

Fusion is currently the best entry point for a surrogate loop, because:

its scope is bounded;
its changes are often local;
it consumes signals already produced by header and values;
it can be retrained much faster if the right caches are put in place.

The idea is to freeze, for a given benchmark:

the splits;
the outputs of the header branch;
the outputs of the values branch;
the target labels;
the base fusion features.

Then the autoresearch loop only retrains the fusion model, or a small variant of its features, without redoing the full stack on every run.

What the Surrogate Loop Must Optimise¶

The local metric must not pretend to replace product truth. It must be:

fast;
stable;
sensitive to genuine fusion gains;
sufficiently correlated with stack validation to serve as a filter.

In this mode, we optimise a FusionSurrogateScore computed from pre-prepared folds.

This score should mainly reward:

the right header/values combination;
robustness on anonymous columns;
quality on en_field;
quality on tropical_field and research_traits if these examples are present in the cache;
reduction of statistic.count false positives on coded columns.

Caches to Produce¶

Speed depends on this.

For each frozen fold, we want to store:

train_records
test_records
header probabilities aligned on all concepts
values probabilities aligned on all concepts
base fusion meta-features
target labels
bucket metadata:
- tropical_field
- research_traits
- en_field
- gbif_core_standard
- anonymous

Recommended format:

data/cache/ml/fusion_surrogate/<cache_version>/fold_*.npz
plus a manifest.json describing:
- gold set hash
- concept list
- protocol
- feature version

Two Surrogate Levels¶

1. Ultra-Fast Surrogate¶

Goal:

explore broadly;
reject quickly;
accept that it is slightly approximate.

2. Promotion Surrogate¶

Goal:

confirm the best candidates;
verify that a local fusion gain is not heading in a misleading direction.

Recommended Validation Chain¶

The new chain should become:

fusion-surrogate-fast
fusion-surrogate-mid
product-score-fast-fast
product-score-mid
product-score
niamoto-score

Interpretation:

the first two levels serve autonomous exploration;
levels 3 to 6 serve promotion of genuine winners.

Recommended Acceptance Rule¶

A candidate can be:

candidate
- beats fusion-surrogate-fast
provisional
- beats fusion-surrogate-mid
promotable
- beats product-score-fast-fast
validated
- beats product-score-mid
certified
- beats product-score
- does not break niamoto-score

The important point:

the autonomous loop must not wait for certified before continuing to run;
it should primarily produce a queue of promising candidates.

Recommended Initial Scope¶

First recommended work:

fusion-only loop

Relevant files:

ml/scripts/train/train_fusion.py
src/niamoto/core/imports/ml/classifier.py
future surrogate cache script

Not to include in the first loop:

alias registry
profiler rules
gold set
dashboard
product docs

Non-Goals¶

This pivot does not seek to:

replace real product validation;
hide global regressions;
remove product-score or niamoto-score.

It seeks to:

make autoresearch finally practical;
restore cadence to exploration;
clearly separate exploration from certification.

Current State¶

The minimal building blocks are now in place:

build the fusion_surrogate cache via:

uv run python -m ml.scripts.research.build_fusion_surrogate_cache --gold-set ml/data/gold_set.json --splits 3

expose a command such as:

uv run python -m ml.scripts.eval.evaluate --model fusion --metric surrogate-fast

also expose:

uv run python -m ml.scripts.eval.evaluate --model fusion --metric surrogate-mid

run autoresearch only on these metrics
promote only the best candidates to full stack validation

Local runner added:

uv run python -m ml.scripts.research.run_fusion_surrogate_autoresearch --iterations 50

Behaviour:

computes surrogate-fast, surrogate-mid baselines
defers product-score-fast-fast until the first candidate that passes surrogate-mid
launches one codex iteration per candidate
evaluates gates itself
reverts losers
automatically commits winners to keep a clean worktree
writes a JSONL log under .autoresearch/

First Measurements¶

On the current gold set:

fusion_surrogate cache build (splits=3):
- approximately 490s (~8m10s)
surrogate-fast on warm cache:
- approximately 1.66s
surrogate-mid on warm cache:
- approximately 1.31s

Reading:

the initial build remains a notable one-shot cost;
however, fusion-only runs are now fast enough to support a real autoresearch loop.

Reference Decision¶

As long as this surrogate loop does not exist, full-stack autoresearch remains structurally too slow to deliver its real value.