# Acquisition Wave Retrain and Evaluation (2026-03-21)

> Status: Experiment
> Audience: Team, AI agents
> Purpose: Measure the impact of the `SINP 1A + ETS + sPlotOpen` acquisition
> wave after a full rebuild and retrain

## Context

This run measures the effect of the acquisition wave on the ML detection stack
after:

1. rebuilding the gold set
2. retraining the `header`, `values`, and `fusion` branches
3. rerunning both the internal benchmark and the real-dataset suite

## Data integrated before retrain

### Added to the gold set

- `TAXREF v18` (`ml/data/silver/taxref/TAXREFv18.txt`)
- `ETS Occurrence_ext.csv`
- `ETS Taxon_ext.csv`
- `ETS Measurement_or_Fact_ext.csv`
- `sPlotOpen_header(3).txt`
- `sPlotOpen_DT(2).txt`
- `sPlotOpen_CWM_CWV(2).txt`
- `sPlotOpen_metadata(2).txt`

### Added at runtime only

- `sinp:` alias block in `column_aliases.yaml`
- `ets:` alias block in `column_aliases.yaml`
- `splot:` alias block in `column_aliases.yaml`

### Explicitly left out

- `OpenObs / SINP`: source unavailable
- `species_trait_data.csv`: too specialized / semantically fragile
- `PREDICTS`: kept outside the critical path

## Resulting gold set

| Metric | Value |
|--------|:-----:|
| Labelled columns | **2540** |
| Coarse concepts | **61** |
| Added sources | `taxref_v18`, `ets_*`, `splot_*` |

Visible contribution by source:

| Source | Columns |
|--------|:-------:|
| `taxref_v18` | 17 |
| `ets_occurrence_ext` | 9 |
| `ets_taxon_ext` | 17 |
| `ets_measurement_ext` | 4 |
| `splot_header` | 33 |
| `splot_dt` | 6 |
| `splot_cwm` | 40 |
| `splot_metadata` | 15 |

## Commands run

```bash
uv run python -m ml.scripts.data.build_gold_set
uv run python -m ml.scripts.train.train_header_model
uv run python -m ml.scripts.train.train_value_model
uv run python -m ml.scripts.train.train_fusion
uv run python -m ml.scripts.eval.run_eval_suite
```

## Training results

| Model | Cross-val macro-F1 | Note |
|-------|:------------------:|------|
| Header | **0.753** | strongest branch |
| Values | **0.378** | useful signal, still weak on generalization |
| Fusion | **0.639** | stronger than `values`, still below `header` alone |

### Warnings

`header` and `fusion` triggered `ConvergenceWarning` messages during retraining.
The models still trained and were saved successfully, but `max_iter` and/or the
solver deserved a later adjustment.

## Evaluation results

Output file:

- `ml/data/eval/results/20260321_194036.json`

### Internal benchmarks after retrain

| Benchmark | Value |
|-----------|:-----:|
| **ProductScore** | **80.8392** |
| **GlobalScore / NiamotoOfflineScore** | **82.764** |

Detailed `ProductScore`:

| Bucket | Score |
|--------|:-----:|
| `gbif_core_standard` | 98.511 |
| `gbif_extended` | 91.018 |
| `en_field` | 82.672 |
| `tropical_field` | 75.093 |
| `research_traits` | 71.621 |
| `anonymous` | 63.634 |

Interpretation:

- the historical holdout metrics remain broadly solid
- they are consistent with the annotated dataset suite
- `anonymous` is now the clearest penalizing bucket
- coded headers and inventory-style exports remain the main generalization ceiling

### Full eval suite (9 datasets, 478 columns)

| Dataset | Cols | Role % | Concept % |
|---------|:----:|:------:|:---------:|
| `niamoto-nc` | 57 | 96.5 | **91.2** |
| `niamoto-gb` | 27 | 100.0 | **100.0** |
| `guyadiv` | 61 | 83.6 | **83.6** |
| `gbif-nc` | 51 | 94.1 | **90.2** |
| `gbif-gabon` | 45 | 91.1 | **88.9** |
| `gbif-inst-gabon` | 41 | 90.2 | **87.8** |
| `silver` | 136 | 89.0 | **77.2** |
| `acceptance-niamoto-gb` | 27 | 100.0 | **100.0** |
| `acceptance-fia-or` | 33 | 75.8 | **63.6** |
| **TOTAL** | **478** | **90.4** | **84.7** |

### Product-oriented aggregate view (7 datasets, 418 columns)

| Aggregate | Role % | Concept % |
|-----------|:------:|:---------:|
| Tier 1 + Tier 1b + Silver | **90.9** | **85.4** |

## Interpretation

### What worked well

- `niamoto-gb` stayed at **100%**
- `niamoto-nc` rose to **91.2%**
- `guyadiv` rose to **83.6%**
- the three GBIF datasets closest to the product all landed between **87.8%** and **90.2%**

The acquisition wave clearly improved the core product datasets and nearby
standardized cases.

### What remains weak

The frozen out-of-train benchmark remains dominated by `acceptance-fia-or`:

| Dataset | Concept % |
|---------|:---------:|
| `acceptance-niamoto-gb` | 100.0 |
| `acceptance-fia-or` | **63.6** |

Main remaining errors:

- `measurement.biomass -> measurement.volume`
- `identifier.taxon -> taxonomy.species`
- `category.habitat -> (not found)`
- FIA-coded headers such as `SPCD`, `CR`, `VOLCFNET`, `VOLBFNET`
- GBIF taxonomic keys such as `acceptedTaxonKey`, `speciesKey`,
  `genericName`, `infraspecificEpithet`, `scientificNameAuthorship`

## Conclusion

This acquisition wave was worth it.

- Yes, it improved the Niamoto core datasets
- Yes, it stabilized real GBIF-like cases
- No, it did not solve coded inventory generalization yet

The next logical step was not another acquisition wave, but a targeted
correction pass on:

1. the remaining GBIF taxonomic key columns
2. the FIA-coded columns
3. `measurement.biomass`, `identifier.taxon`, `category.habitat`,
   `text.metadata`