Multi-Instance ML Evaluation and Improvement (2026-03-20)¶
Status: Experiment Audience: Team, AI agents Purpose: Summarize the first full evaluation round on multiple real datasets and the corrections applied during that session
Context¶
This session evaluated the ML stack on several real datasets, corrected annotation issues by inspecting actual values, enriched the alias registry, and retrained the models on an enriched gold set.
Evaluation setup¶
Ground truth in
ml/data/eval/annotations/Main scripts:
ml/scripts/eval/evaluate_instance.pyml/scripts/eval/run_eval_suite.py
Timestamped JSON results in
ml/data/eval/results/
Score progression¶
V1 — Initial annotations¶
Dataset |
Columns |
Role % |
Concept % |
|---|---|---|---|
niamoto-nc |
57 |
61.4 |
45.6 |
niamoto-gb |
27 |
88.9 |
66.7 |
guyadiv |
61 |
85.2 |
63.9 |
GBIF NC |
51 |
84.3 |
76.5 |
GBIF Gabon |
45 |
86.7 |
77.8 |
GBIF inst. Gabon |
41 |
82.9 |
75.6 |
silver |
136 |
86.0 |
66.2 |
TOTAL |
418 |
82.3 |
66.5 |
V2 — After taxonomy + plot_name correction¶
Main correction:
taxonomy.name -> taxonomy.speciesfor true binomialsplot_nameadjusted based on real values
Aggregate |
Before |
After |
Delta |
|---|---|---|---|
TOTAL concept % |
66.5 |
71.1 |
+4.6 |
V3 — After value-level annotation verification¶
Examples:
canopy/undercanopy/understorey->statistic.countSPCD->identifier.taxonMnemonic->identifier.taxonAuthor/auth_sp->text.metadataVernacular_name->taxonomy.vernacular_name
Aggregate |
Before |
After |
Delta |
|---|---|---|---|
TOTAL concept % |
71.1 |
71.3 |
+0.2 |
V4 — After alias registry enrichment¶
New alias coverage added for 13 concepts, including:
measurement.traitcategory.ecologycategory.statuscategory.vegetationcategory.methodenvironment.topographymeasurement.canopyidentifier.collectionidentifier.institutionlocation.admin_areatext.observer
Dataset |
V3 |
V4 |
Delta |
|---|---|---|---|
niamoto-nc |
54.4 |
68.4 |
+14.0 |
niamoto-gb |
74.1 |
74.1 |
0 |
guyadiv |
65.6 |
65.6 |
0 |
GBIF NC |
82.4 |
90.2 |
+7.8 |
GBIF Gabon |
82.2 |
88.9 |
+6.7 |
GBIF inst. |
80.5 |
85.4 |
+4.9 |
silver |
69.9 |
73.5 |
+3.6 |
TOTAL |
71.3 |
76.6 |
+5.3 |
Gold-set diagnosis¶
The main diagnosis at that stage was:
some weak concepts were absent or nearly absent from the gold set
measurement.diameterwas overrepresentedmeasurement.basal_area -> measurement.diameterwas a bad taxonomy merge and had to be fixed
Examples of missing or weakly covered concepts then:
measurement.traitcategory.ecologyenvironment.topographytext.metadatameasurement.areacategory.status
Actions taken¶
Gold set enrichment¶
Added NC_FULL_OCC_LABELS and NC_FULL_PLOTS_LABELS from the Niamoto New
Caledonia instance into build_gold_set.py.
Gold set size moved from 2492 to 2525.
V5 — Retrain on the enriched gold set¶
Model |
Before |
After |
|---|---|---|
header |
0.7614 |
0.7467 |
values |
0.3783 |
0.3935 |
fusion |
0.6899 |
0.6876 |
V5 — Results after retrain¶
Dataset |
V4 |
V5 |
Delta |
|---|---|---|---|
niamoto-nc |
68.4 |
87.7 |
+19.3 |
niamoto-gb |
74.1 |
66.7 |
-7.4 |
guyadiv |
65.6 |
65.6 |
0 |
GBIF NC |
90.2 |
90.2 |
0 |
GBIF Gabon |
88.9 |
88.9 |
0 |
GBIF inst. |
85.4 |
87.8 |
+2.4 |
silver |
73.5 |
69.1 |
-4.4 |
TOTAL |
76.6 |
77.5 |
+0.9 |
Key observation:
niamoto-ncjumped strongly because the model now knew those columnssome broader datasets dipped slightly because the gold set had shifted and had not yet been rebalanced further
Session summary¶
Step |
TOTAL concept % |
Delta |
|---|---|---|
V1 — Initial annotations |
66.5 |
— |
V2 — Taxonomy + |
71.1 |
+4.6 |
V3 — Value verification |
71.3 |
+0.2 |
V4 — Alias enrichment |
76.6 |
+5.3 |
V5 — Gold set + retrain |
77.5 |
+0.9 |
Total gain |
+11.0 pts |
ProductScore and GlobalScore after V5¶
Metric |
Before |
After |
Delta |
|---|---|---|---|
ProductScore |
80.04 |
81.82 |
+1.78 |
GlobalScore |
78.6 |
80.79 |
+2.19 |
The strongest gains landed in:
tropical_fieldresearch_traits
which matched the families newly enriched in the gold set.
Other notable outcome¶
evaluate.py was switched from record-by-record fusion feature extraction to
the existing batched path, which reduced ProductScore runtime from roughly
14 hours to roughly 42 minutes.
Remaining weak areas after V5¶
taxonomy.namemeasurement.areaenvironment.temperaturemeasurement.biomasstext.metadataidentifier.taxoncategory.status
Systematically wrong columns at that stage included:
acceptedTaxonKeyspeciesKeygenericNameinfraspecificEpithetscientificNameAuthorship
Reproduction commands¶
uv run python -m ml.scripts.eval.run_eval_suite
uv run python -m ml.scripts.eval.evaluate_instance \
--annotations ml/data/eval/annotations/niamoto-nc.yml \
--data-dir test-instance/niamoto-nc/imports --compare
uv run python -m ml.scripts.data.build_gold_set
uv run python -m ml.scripts.train.train_header_model
uv run python -m ml.scripts.train.train_value_model
uv run python -m ml.scripts.train.train_fusion