# Autoresearch Surrogate Loop > Status: Active > Audience: Team, AI agents > Purpose: Current surrogate-loop strategy and acceptance logic for > autoresearch ## Purpose This document formalises the following pivot: - stop running `autoresearch` directly against an end-to-end metric that is too costly; - refocus the autonomous loop on a **fusion-only surrogate loop**; - reserve full-stack metrics for deferred validation of the best candidates. ## Observed Problem The `autoresearch` pattern is only worthwhile if it can chain many iterations autonomously. On this branch, even recent variants of fast metrics remain too expensive: - `product-score`: too slow for an autonomous loop; - `product-score-mid`: still too costly for broad exploration; - `product-score-fast-fast`: shorter, but still too long to target 50+ runs per session. The real problem is therefore not just the metric. It is the **granularity of the search**: - we retrain the entire stack for a change that is often local; - we pay a full cost for a modification that only concerns the fusion; - the autonomous loop spends its budget on certification instead of exploring. ## Decision The next `autoresearch` mode must no longer be: - `full stack -> full metric -> full guardrails` at every iteration. It must become: - `very cheap local surrogate -> many iterations` - then `deferred validation` on the best candidates only. ## Principle of the Fusion-Only Loop Fusion is currently the best entry point for a surrogate loop, because: - its scope is bounded; - its changes are often local; - it consumes signals already produced by `header` and `values`; - it can be retrained much faster if the right caches are put in place. The idea is to freeze, for a given benchmark: - the splits; - the outputs of the `header` branch; - the outputs of the `values` branch; - the target labels; - the base fusion features. Then the `autoresearch` loop only retrains the fusion model, or a small variant of its features, without redoing the full stack on every run. ## What the Surrogate Loop Must Optimise The local metric must not pretend to replace product truth. It must be: - fast; - stable; - sensitive to genuine fusion gains; - sufficiently correlated with stack validation to serve as a filter. In this mode, we optimise a **FusionSurrogateScore** computed from pre-prepared folds. This score should mainly reward: - the right `header/values` combination; - robustness on anonymous columns; - quality on `en_field`; - quality on `tropical_field` and `research_traits` if these examples are present in the cache; - reduction of `statistic.count` false positives on coded columns. ## Caches to Produce Speed depends on this. For each frozen fold, we want to store: - `train_records` - `test_records` - `header` probabilities aligned on all concepts - `values` probabilities aligned on all concepts - base fusion meta-features - target labels - bucket metadata: - `tropical_field` - `research_traits` - `en_field` - `gbif_core_standard` - `anonymous` Recommended format: - `data/cache/ml/fusion_surrogate//fold_*.npz` - plus a `manifest.json` describing: - gold set hash - concept list - protocol - feature version ## Two Surrogate Levels ### 1. Ultra-Fast Surrogate Goal: - explore broadly; - reject quickly; - accept that it is slightly approximate. Recommended content: - a few fixed folds; - critical buckets only; - fusion-only retraining; - no header/values rerun. ### 2. Promotion Surrogate Goal: - confirm the best candidates; - verify that a local fusion gain is not heading in a misleading direction. Recommended content: - same cache, but more folds or more buckets; - still without retraining the entire stack; - higher cost than ultra-fast, but still well below the full `product-score`. ## Recommended Validation Chain The new chain should become: 1. `fusion-surrogate-fast` 2. `fusion-surrogate-mid` 3. `product-score-fast-fast` 4. `product-score-mid` 5. `product-score` 6. `niamoto-score` Interpretation: - the first two levels serve autonomous exploration; - levels 3 to 6 serve promotion of genuine winners. ## Recommended Acceptance Rule A candidate can be: - `candidate` - beats `fusion-surrogate-fast` - `provisional` - beats `fusion-surrogate-mid` - `promotable` - beats `product-score-fast-fast` - `validated` - beats `product-score-mid` - `certified` - beats `product-score` - does not break `niamoto-score` The important point: - the autonomous loop must not wait for `certified` before continuing to run; - it should primarily produce a queue of promising candidates. ## Recommended Initial Scope First recommended work: - `fusion-only` loop Relevant files: - `ml/scripts/train/train_fusion.py` - `src/niamoto/core/imports/ml/classifier.py` - future surrogate cache script Not to include in the first loop: - alias registry - profiler rules - gold set - dashboard - product docs ## Non-Goals This pivot does not seek to: - replace real product validation; - hide global regressions; - remove `product-score` or `niamoto-score`. It seeks to: - make `autoresearch` finally practical; - restore cadence to exploration; - clearly separate exploration from certification. ## Current State The minimal building blocks are now in place: 1. build the `fusion_surrogate` cache via: ```bash uv run python -m ml.scripts.research.build_fusion_surrogate_cache --gold-set ml/data/gold_set.json --splits 3 ``` 2. expose a command such as: ```bash uv run python -m ml.scripts.eval.evaluate --model fusion --metric surrogate-fast ``` 3. also expose: ```bash uv run python -m ml.scripts.eval.evaluate --model fusion --metric surrogate-mid ``` 4. run `autoresearch` only on these metrics 5. promote only the best candidates to full stack validation Local runner added: ```bash uv run python -m ml.scripts.research.run_fusion_surrogate_autoresearch --iterations 50 ``` Behaviour: - computes `surrogate-fast`, `surrogate-mid` baselines - defers `product-score-fast-fast` until the first candidate that passes `surrogate-mid` - launches one `codex` iteration per candidate - evaluates gates itself - reverts losers - automatically commits winners to keep a clean worktree - writes a JSONL log under `.autoresearch/` ## First Measurements On the current gold set: - `fusion_surrogate` cache build (`splits=3`): - approximately `490s` (`~8m10s`) - `surrogate-fast` on warm cache: - approximately `1.66s` - `surrogate-mid` on warm cache: - approximately `1.31s` Reading: - the initial build remains a notable one-shot cost; - however, fusion-only runs are now fast enough to support a real `autoresearch` loop. ## Reference Decision As long as this surrogate loop does not exist, full-stack `autoresearch` remains structurally too slow to deliver its real value.