Multi-Instance ML Evaluation and Improvement (2026-03-20)¶

Status: Experiment Audience: Team, AI agents Purpose: Summarize the first full evaluation round on multiple real datasets and the corrections applied during that session

Context¶

This session evaluated the ML stack on several real datasets, corrected annotation issues by inspecting actual values, enriched the alias registry, and retrained the models on an enriched gold set.

Evaluation setup¶

Ground truth in ml/data/eval/annotations/
Main scripts:
- ml/scripts/eval/evaluate_instance.py
- ml/scripts/eval/run_eval_suite.py
Timestamped JSON results in ml/data/eval/results/

Score progression¶

V1 — Initial annotations¶

Dataset	Columns	Role %	Concept %
niamoto-nc	57	61.4	45.6
niamoto-gb	27	88.9	66.7
guyadiv	61	85.2	63.9
GBIF NC	51	84.3	76.5
GBIF Gabon	45	86.7	77.8
GBIF inst. Gabon	41	82.9	75.6
silver	136	86.0	66.2
TOTAL	418	82.3	66.5

V2 — After taxonomy + `plot_name` correction¶

Main correction:

taxonomy.name -> taxonomy.species for true binomials
plot_name adjusted based on real values

Aggregate	Before	After	Delta
TOTAL concept %	66.5	71.1	+4.6

V3 — After value-level annotation verification¶

Examples:

canopy / undercanopy / understorey -> statistic.count
SPCD -> identifier.taxon
Mnemonic -> identifier.taxon
Author / auth_sp -> text.metadata
Vernacular_name -> taxonomy.vernacular_name

Aggregate	Before	After	Delta
TOTAL concept %	71.1	71.3	+0.2

V4 — After alias registry enrichment¶

New alias coverage added for 13 concepts, including:

measurement.trait
category.ecology
category.status
category.vegetation
category.method
environment.topography
measurement.canopy
identifier.collection
identifier.institution
location.admin_area
text.observer

Dataset	V3	V4	Delta
niamoto-nc	54.4	68.4	+14.0
niamoto-gb	74.1	74.1	0
guyadiv	65.6	65.6	0
GBIF NC	82.4	90.2	+7.8
GBIF Gabon	82.2	88.9	+6.7
GBIF inst.	80.5	85.4	+4.9
silver	69.9	73.5	+3.6
TOTAL	71.3	76.6	+5.3

Gold-set diagnosis¶

The main diagnosis at that stage was:

some weak concepts were absent or nearly absent from the gold set
measurement.diameter was overrepresented
measurement.basal_area -> measurement.diameter was a bad taxonomy merge and had to be fixed

Examples of missing or weakly covered concepts then:

measurement.trait
category.ecology
environment.topography
text.metadata
measurement.area
category.status

Actions taken¶

Gold set enrichment¶

Added NC_FULL_OCC_LABELS and NC_FULL_PLOTS_LABELS from the Niamoto New Caledonia instance into build_gold_set.py.

Gold set size moved from 2492 to 2525.

V5 — Retrain on the enriched gold set¶

Model	Before	After
header	0.7614	0.7467
values	0.3783	0.3935
fusion	0.6899	0.6876

V5 — Results after retrain¶

Dataset	V4	V5	Delta
niamoto-nc	68.4	87.7	+19.3
niamoto-gb	74.1	66.7	-7.4
guyadiv	65.6	65.6	0
GBIF NC	90.2	90.2	0
GBIF Gabon	88.9	88.9	0
GBIF inst.	85.4	87.8	+2.4
silver	73.5	69.1	-4.4
TOTAL	76.6	77.5	+0.9

Key observation:

niamoto-nc jumped strongly because the model now knew those columns
some broader datasets dipped slightly because the gold set had shifted and had not yet been rebalanced further

Session summary¶

Step	TOTAL concept %	Delta
V1 — Initial annotations	66.5	—
V2 — Taxonomy + `plot_name` correction	71.1	+4.6
V3 — Value verification	71.3	+0.2
V4 — Alias enrichment	76.6	+5.3
V5 — Gold set + retrain	77.5	+0.9
Total gain		+11.0 pts

ProductScore and GlobalScore after V5¶

Metric	Before	After	Delta
ProductScore	80.04	81.82	+1.78
GlobalScore	78.6	80.79	+2.19

The strongest gains landed in:

tropical_field
research_traits

which matched the families newly enriched in the gold set.

Other notable outcome¶

evaluate.py was switched from record-by-record fusion feature extraction to the existing batched path, which reduced ProductScore runtime from roughly 14 hours to roughly 42 minutes.

Remaining weak areas after V5¶

taxonomy.name
measurement.area
environment.temperature
measurement.biomass
text.metadata
identifier.taxon
category.status

Systematically wrong columns at that stage included:

acceptedTaxonKey
speciesKey
genericName
infraspecificEpithet
scientificNameAuthorship

Reproduction commands¶

uv run python -m ml.scripts.eval.run_eval_suite

uv run python -m ml.scripts.eval.evaluate_instance \
    --annotations ml/data/eval/annotations/niamoto-nc.yml \
    --data-dir test-instance/niamoto-nc/imports --compare

uv run python -m ml.scripts.data.build_gold_set
uv run python -m ml.scripts.train.train_header_model
uv run python -m ml.scripts.train.train_value_model
uv run python -m ml.scripts.train.train_fusion