Acquisition Wave Retrain and Evaluation (2026-03-21)¶
Status: Experiment Audience: Team, AI agents Purpose: Measure the impact of the
SINP 1A + ETS + sPlotOpenacquisition wave after a full rebuild and retrain
Context¶
This run measures the effect of the acquisition wave on the ML detection stack after:
rebuilding the gold set
retraining the
header,values, andfusionbranchesrerunning both the internal benchmark and the real-dataset suite
Data integrated before retrain¶
Added to the gold set¶
TAXREF v18(ml/data/silver/taxref/TAXREFv18.txt)ETS Occurrence_ext.csvETS Taxon_ext.csvETS Measurement_or_Fact_ext.csvsPlotOpen_header(3).txtsPlotOpen_DT(2).txtsPlotOpen_CWM_CWV(2).txtsPlotOpen_metadata(2).txt
Added at runtime only¶
sinp:alias block incolumn_aliases.yamlets:alias block incolumn_aliases.yamlsplot:alias block incolumn_aliases.yaml
Explicitly left out¶
OpenObs / SINP: source unavailablespecies_trait_data.csv: too specialized / semantically fragilePREDICTS: kept outside the critical path
Resulting gold set¶
Metric |
Value |
|---|---|
Labelled columns |
2540 |
Coarse concepts |
61 |
Added sources |
|
Visible contribution by source:
Source |
Columns |
|---|---|
|
17 |
|
9 |
|
17 |
|
4 |
|
33 |
|
6 |
|
40 |
|
15 |
Commands run¶
uv run python -m ml.scripts.data.build_gold_set
uv run python -m ml.scripts.train.train_header_model
uv run python -m ml.scripts.train.train_value_model
uv run python -m ml.scripts.train.train_fusion
uv run python -m ml.scripts.eval.run_eval_suite
Training results¶
Model |
Cross-val macro-F1 |
Note |
|---|---|---|
Header |
0.753 |
strongest branch |
Values |
0.378 |
useful signal, still weak on generalization |
Fusion |
0.639 |
stronger than |
Warnings¶
header and fusion triggered ConvergenceWarning messages during retraining.
The models still trained and were saved successfully, but max_iter and/or the
solver deserved a later adjustment.
Evaluation results¶
Output file:
ml/data/eval/results/20260321_194036.json
Internal benchmarks after retrain¶
Benchmark |
Value |
|---|---|
ProductScore |
80.8392 |
GlobalScore / NiamotoOfflineScore |
82.764 |
Detailed ProductScore:
Bucket |
Score |
|---|---|
|
98.511 |
|
91.018 |
|
82.672 |
|
75.093 |
|
71.621 |
|
63.634 |
Interpretation:
the historical holdout metrics remain broadly solid
they are consistent with the annotated dataset suite
anonymousis now the clearest penalizing bucketcoded headers and inventory-style exports remain the main generalization ceiling
Full eval suite (9 datasets, 478 columns)¶
Dataset |
Cols |
Role % |
Concept % |
|---|---|---|---|
|
57 |
96.5 |
91.2 |
|
27 |
100.0 |
100.0 |
|
61 |
83.6 |
83.6 |
|
51 |
94.1 |
90.2 |
|
45 |
91.1 |
88.9 |
|
41 |
90.2 |
87.8 |
|
136 |
89.0 |
77.2 |
|
27 |
100.0 |
100.0 |
|
33 |
75.8 |
63.6 |
TOTAL |
478 |
90.4 |
84.7 |
Product-oriented aggregate view (7 datasets, 418 columns)¶
Aggregate |
Role % |
Concept % |
|---|---|---|
Tier 1 + Tier 1b + Silver |
90.9 |
85.4 |
Interpretation¶
What worked well¶
niamoto-gbstayed at 100%niamoto-ncrose to 91.2%guyadivrose to 83.6%the three GBIF datasets closest to the product all landed between 87.8% and 90.2%
The acquisition wave clearly improved the core product datasets and nearby standardized cases.
What remains weak¶
The frozen out-of-train benchmark remains dominated by acceptance-fia-or:
Dataset |
Concept % |
|---|---|
|
100.0 |
|
63.6 |
Main remaining errors:
measurement.biomass -> measurement.volumeidentifier.taxon -> taxonomy.speciescategory.habitat -> (not found)FIA-coded headers such as
SPCD,CR,VOLCFNET,VOLBFNETGBIF taxonomic keys such as
acceptedTaxonKey,speciesKey,genericName,infraspecificEpithet,scientificNameAuthorship
Conclusion¶
This acquisition wave was worth it.
Yes, it improved the Niamoto core datasets
Yes, it stabilized real GBIF-like cases
No, it did not solve coded inventory generalization yet
The next logical step was not another acquisition wave, but a targeted correction pass on:
the remaining GBIF taxonomic key columns
the FIA-coded columns
measurement.biomass,identifier.taxon,category.habitat,text.metadata