Acquisition Wave Retrain and Evaluation (2026-03-21)

Status: Experiment Audience: Team, AI agents Purpose: Measure the impact of the SINP 1A + ETS + sPlotOpen acquisition wave after a full rebuild and retrain

Context

This run measures the effect of the acquisition wave on the ML detection stack after:

  1. rebuilding the gold set

  2. retraining the header, values, and fusion branches

  3. rerunning both the internal benchmark and the real-dataset suite

Data integrated before retrain

Added to the gold set

  • TAXREF v18 (ml/data/silver/taxref/TAXREFv18.txt)

  • ETS Occurrence_ext.csv

  • ETS Taxon_ext.csv

  • ETS Measurement_or_Fact_ext.csv

  • sPlotOpen_header(3).txt

  • sPlotOpen_DT(2).txt

  • sPlotOpen_CWM_CWV(2).txt

  • sPlotOpen_metadata(2).txt

Added at runtime only

  • sinp: alias block in column_aliases.yaml

  • ets: alias block in column_aliases.yaml

  • splot: alias block in column_aliases.yaml

Explicitly left out

  • OpenObs / SINP: source unavailable

  • species_trait_data.csv: too specialized / semantically fragile

  • PREDICTS: kept outside the critical path

Resulting gold set

Metric

Value

Labelled columns

2540

Coarse concepts

61

Added sources

taxref_v18, ets_*, splot_*

Visible contribution by source:

Source

Columns

taxref_v18

17

ets_occurrence_ext

9

ets_taxon_ext

17

ets_measurement_ext

4

splot_header

33

splot_dt

6

splot_cwm

40

splot_metadata

15

Commands run

uv run python -m ml.scripts.data.build_gold_set
uv run python -m ml.scripts.train.train_header_model
uv run python -m ml.scripts.train.train_value_model
uv run python -m ml.scripts.train.train_fusion
uv run python -m ml.scripts.eval.run_eval_suite

Training results

Model

Cross-val macro-F1

Note

Header

0.753

strongest branch

Values

0.378

useful signal, still weak on generalization

Fusion

0.639

stronger than values, still below header alone

Warnings

header and fusion triggered ConvergenceWarning messages during retraining. The models still trained and were saved successfully, but max_iter and/or the solver deserved a later adjustment.

Evaluation results

Output file:

  • ml/data/eval/results/20260321_194036.json

Internal benchmarks after retrain

Benchmark

Value

ProductScore

80.8392

GlobalScore / NiamotoOfflineScore

82.764

Detailed ProductScore:

Bucket

Score

gbif_core_standard

98.511

gbif_extended

91.018

en_field

82.672

tropical_field

75.093

research_traits

71.621

anonymous

63.634

Interpretation:

  • the historical holdout metrics remain broadly solid

  • they are consistent with the annotated dataset suite

  • anonymous is now the clearest penalizing bucket

  • coded headers and inventory-style exports remain the main generalization ceiling

Full eval suite (9 datasets, 478 columns)

Dataset

Cols

Role %

Concept %

niamoto-nc

57

96.5

91.2

niamoto-gb

27

100.0

100.0

guyadiv

61

83.6

83.6

gbif-nc

51

94.1

90.2

gbif-gabon

45

91.1

88.9

gbif-inst-gabon

41

90.2

87.8

silver

136

89.0

77.2

acceptance-niamoto-gb

27

100.0

100.0

acceptance-fia-or

33

75.8

63.6

TOTAL

478

90.4

84.7

Product-oriented aggregate view (7 datasets, 418 columns)

Aggregate

Role %

Concept %

Tier 1 + Tier 1b + Silver

90.9

85.4

Interpretation

What worked well

  • niamoto-gb stayed at 100%

  • niamoto-nc rose to 91.2%

  • guyadiv rose to 83.6%

  • the three GBIF datasets closest to the product all landed between 87.8% and 90.2%

The acquisition wave clearly improved the core product datasets and nearby standardized cases.

What remains weak

The frozen out-of-train benchmark remains dominated by acceptance-fia-or:

Dataset

Concept %

acceptance-niamoto-gb

100.0

acceptance-fia-or

63.6

Main remaining errors:

  • measurement.biomass -> measurement.volume

  • identifier.taxon -> taxonomy.species

  • category.habitat -> (not found)

  • FIA-coded headers such as SPCD, CR, VOLCFNET, VOLBFNET

  • GBIF taxonomic keys such as acceptedTaxonKey, speciesKey, genericName, infraspecificEpithet, scientificNameAuthorship

Conclusion

This acquisition wave was worth it.

  • Yes, it improved the Niamoto core datasets

  • Yes, it stabilized real GBIF-like cases

  • No, it did not solve coded inventory generalization yet

The next logical step was not another acquisition wave, but a targeted correction pass on:

  1. the remaining GBIF taxonomic key columns

  2. the FIA-coded columns

  3. measurement.biomass, identifier.taxon, category.habitat, text.metadata