Multi-Instance ML Evaluation and Improvement (2026-03-20)

Status: Experiment Audience: Team, AI agents Purpose: Summarize the first full evaluation round on multiple real datasets and the corrections applied during that session

Context

This session evaluated the ML stack on several real datasets, corrected annotation issues by inspecting actual values, enriched the alias registry, and retrained the models on an enriched gold set.

Evaluation setup

  • Ground truth in ml/data/eval/annotations/

  • Main scripts:

    • ml/scripts/eval/evaluate_instance.py

    • ml/scripts/eval/run_eval_suite.py

  • Timestamped JSON results in ml/data/eval/results/

Score progression

V1 — Initial annotations

Dataset

Columns

Role %

Concept %

niamoto-nc

57

61.4

45.6

niamoto-gb

27

88.9

66.7

guyadiv

61

85.2

63.9

GBIF NC

51

84.3

76.5

GBIF Gabon

45

86.7

77.8

GBIF inst. Gabon

41

82.9

75.6

silver

136

86.0

66.2

TOTAL

418

82.3

66.5

V2 — After taxonomy + plot_name correction

Main correction:

  • taxonomy.name -> taxonomy.species for true binomials

  • plot_name adjusted based on real values

Aggregate

Before

After

Delta

TOTAL concept %

66.5

71.1

+4.6

V3 — After value-level annotation verification

Examples:

  • canopy / undercanopy / understorey -> statistic.count

  • SPCD -> identifier.taxon

  • Mnemonic -> identifier.taxon

  • Author / auth_sp -> text.metadata

  • Vernacular_name -> taxonomy.vernacular_name

Aggregate

Before

After

Delta

TOTAL concept %

71.1

71.3

+0.2

V4 — After alias registry enrichment

New alias coverage added for 13 concepts, including:

  • measurement.trait

  • category.ecology

  • category.status

  • category.vegetation

  • category.method

  • environment.topography

  • measurement.canopy

  • identifier.collection

  • identifier.institution

  • location.admin_area

  • text.observer

Dataset

V3

V4

Delta

niamoto-nc

54.4

68.4

+14.0

niamoto-gb

74.1

74.1

0

guyadiv

65.6

65.6

0

GBIF NC

82.4

90.2

+7.8

GBIF Gabon

82.2

88.9

+6.7

GBIF inst.

80.5

85.4

+4.9

silver

69.9

73.5

+3.6

TOTAL

71.3

76.6

+5.3

Gold-set diagnosis

The main diagnosis at that stage was:

  • some weak concepts were absent or nearly absent from the gold set

  • measurement.diameter was overrepresented

  • measurement.basal_area -> measurement.diameter was a bad taxonomy merge and had to be fixed

Examples of missing or weakly covered concepts then:

  • measurement.trait

  • category.ecology

  • environment.topography

  • text.metadata

  • measurement.area

  • category.status

Actions taken

Gold set enrichment

Added NC_FULL_OCC_LABELS and NC_FULL_PLOTS_LABELS from the Niamoto New Caledonia instance into build_gold_set.py.

Gold set size moved from 2492 to 2525.

V5 — Retrain on the enriched gold set

Model

Before

After

header

0.7614

0.7467

values

0.3783

0.3935

fusion

0.6899

0.6876

V5 — Results after retrain

Dataset

V4

V5

Delta

niamoto-nc

68.4

87.7

+19.3

niamoto-gb

74.1

66.7

-7.4

guyadiv

65.6

65.6

0

GBIF NC

90.2

90.2

0

GBIF Gabon

88.9

88.9

0

GBIF inst.

85.4

87.8

+2.4

silver

73.5

69.1

-4.4

TOTAL

76.6

77.5

+0.9

Key observation:

  • niamoto-nc jumped strongly because the model now knew those columns

  • some broader datasets dipped slightly because the gold set had shifted and had not yet been rebalanced further

Session summary

Step

TOTAL concept %

Delta

V1 — Initial annotations

66.5

V2 — Taxonomy + plot_name correction

71.1

+4.6

V3 — Value verification

71.3

+0.2

V4 — Alias enrichment

76.6

+5.3

V5 — Gold set + retrain

77.5

+0.9

Total gain

+11.0 pts

ProductScore and GlobalScore after V5

Metric

Before

After

Delta

ProductScore

80.04

81.82

+1.78

GlobalScore

78.6

80.79

+2.19

The strongest gains landed in:

  • tropical_field

  • research_traits

which matched the families newly enriched in the gold set.

Other notable outcome

evaluate.py was switched from record-by-record fusion feature extraction to the existing batched path, which reduced ProductScore runtime from roughly 14 hours to roughly 42 minutes.

Remaining weak areas after V5

  • taxonomy.name

  • measurement.area

  • environment.temperature

  • measurement.biomass

  • text.metadata

  • identifier.taxon

  • category.status

Systematically wrong columns at that stage included:

  • acceptedTaxonKey

  • speciesKey

  • genericName

  • infraspecificEpithet

  • scientificNameAuthorship

Reproduction commands

uv run python -m ml.scripts.eval.run_eval_suite

uv run python -m ml.scripts.eval.evaluate_instance \
    --annotations ml/data/eval/annotations/niamoto-nc.yml \
    --data-dir test-instance/niamoto-nc/imports --compare

uv run python -m ml.scripts.data.build_gold_set
uv run python -m ml.scripts.train.train_header_model
uv run python -m ml.scripts.train.train_value_model
uv run python -m ml.scripts.train.train_fusion