# Multi-Instance ML Evaluation and Improvement (2026-03-20)

> Status: Experiment
> Audience: Team, AI agents
> Purpose: Summarize the first full evaluation round on multiple real datasets
> and the corrections applied during that session

## Context

This session evaluated the ML stack on several real datasets, corrected
annotation issues by inspecting actual values, enriched the alias registry, and
 retrained the models on an enriched gold set.

## Evaluation setup

- Ground truth in `ml/data/eval/annotations/`
- Main scripts:
  - `ml/scripts/eval/evaluate_instance.py`
  - `ml/scripts/eval/run_eval_suite.py`
- Timestamped JSON results in `ml/data/eval/results/`

## Score progression

### V1 — Initial annotations

| Dataset | Columns | Role % | Concept % |
|---------|:-------:|:------:|:---------:|
| niamoto-nc | 57 | 61.4 | 45.6 |
| niamoto-gb | 27 | 88.9 | 66.7 |
| guyadiv | 61 | 85.2 | 63.9 |
| GBIF NC | 51 | 84.3 | 76.5 |
| GBIF Gabon | 45 | 86.7 | 77.8 |
| GBIF inst. Gabon | 41 | 82.9 | 75.6 |
| silver | 136 | 86.0 | 66.2 |
| **TOTAL** | **418** | **82.3** | **66.5** |

### V2 — After taxonomy + `plot_name` correction

Main correction:

- `taxonomy.name -> taxonomy.species` for true binomials
- `plot_name` adjusted based on real values

| Aggregate | Before | After | Delta |
|-----------|:------:|:-----:|:-----:|
| **TOTAL concept %** | 66.5 | **71.1** | **+4.6** |

### V3 — After value-level annotation verification

Examples:

- `canopy` / `undercanopy` / `understorey` -> `statistic.count`
- `SPCD` -> `identifier.taxon`
- `Mnemonic` -> `identifier.taxon`
- `Author` / `auth_sp` -> `text.metadata`
- `Vernacular_name` -> `taxonomy.vernacular_name`

| Aggregate | Before | After | Delta |
|-----------|:------:|:-----:|:-----:|
| **TOTAL concept %** | 71.1 | **71.3** | **+0.2** |

### V4 — After alias registry enrichment

New alias coverage added for 13 concepts, including:

- `measurement.trait`
- `category.ecology`
- `category.status`
- `category.vegetation`
- `category.method`
- `environment.topography`
- `measurement.canopy`
- `identifier.collection`
- `identifier.institution`
- `location.admin_area`
- `text.observer`

| Dataset | V3 | V4 | Delta |
|---------|:--:|:--:|:-----:|
| niamoto-nc | 54.4 | **68.4** | **+14.0** |
| niamoto-gb | 74.1 | 74.1 | 0 |
| guyadiv | 65.6 | 65.6 | 0 |
| GBIF NC | 82.4 | **90.2** | **+7.8** |
| GBIF Gabon | 82.2 | **88.9** | **+6.7** |
| GBIF inst. | 80.5 | **85.4** | **+4.9** |
| silver | 69.9 | **73.5** | **+3.6** |
| **TOTAL** | **71.3** | **76.6** | **+5.3** |

## Gold-set diagnosis

The main diagnosis at that stage was:

- some weak concepts were absent or nearly absent from the gold set
- `measurement.diameter` was overrepresented
- `measurement.basal_area -> measurement.diameter` was a bad taxonomy merge and had to be fixed

Examples of missing or weakly covered concepts then:

- `measurement.trait`
- `category.ecology`
- `environment.topography`
- `text.metadata`
- `measurement.area`
- `category.status`

## Actions taken

### Gold set enrichment

Added `NC_FULL_OCC_LABELS` and `NC_FULL_PLOTS_LABELS` from the Niamoto New
Caledonia instance into `build_gold_set.py`.

Gold set size moved from `2492` to `2525`.

### V5 — Retrain on the enriched gold set

| Model | Before | After |
|-------|:------:|:-----:|
| header | 0.7614 | 0.7467 |
| values | 0.3783 | 0.3935 |
| fusion | 0.6899 | 0.6876 |

### V5 — Results after retrain

| Dataset | V4 | V5 | Delta |
|---------|:--:|:--:|:-----:|
| niamoto-nc | 68.4 | **87.7** | **+19.3** |
| niamoto-gb | 74.1 | 66.7 | -7.4 |
| guyadiv | 65.6 | 65.6 | 0 |
| GBIF NC | 90.2 | 90.2 | 0 |
| GBIF Gabon | 88.9 | 88.9 | 0 |
| GBIF inst. | 85.4 | **87.8** | +2.4 |
| silver | 73.5 | 69.1 | -4.4 |
| **TOTAL** | **76.6** | **77.5** | **+0.9** |

Key observation:

- `niamoto-nc` jumped strongly because the model now knew those columns
- some broader datasets dipped slightly because the gold set had shifted and had not yet been rebalanced further

## Session summary

| Step | TOTAL concept % | Delta |
|------|:---------------:|:-----:|
| V1 — Initial annotations | 66.5 | — |
| V2 — Taxonomy + `plot_name` correction | 71.1 | +4.6 |
| V3 — Value verification | 71.3 | +0.2 |
| V4 — Alias enrichment | 76.6 | +5.3 |
| V5 — Gold set + retrain | **77.5** | **+0.9** |
| **Total gain** | | **+11.0 pts** |

## ProductScore and GlobalScore after V5

| Metric | Before | After | Delta |
|--------|:------:|:-----:|:-----:|
| **ProductScore** | 80.04 | **81.82** | **+1.78** |
| **GlobalScore** | 78.6 | **80.79** | **+2.19** |

The strongest gains landed in:

- `tropical_field`
- `research_traits`

which matched the families newly enriched in the gold set.

## Other notable outcome

`evaluate.py` was switched from record-by-record fusion feature extraction to
the existing batched path, which reduced `ProductScore` runtime from roughly
14 hours to roughly 42 minutes.

## Remaining weak areas after V5

- `taxonomy.name`
- `measurement.area`
- `environment.temperature`
- `measurement.biomass`
- `text.metadata`
- `identifier.taxon`
- `category.status`

Systematically wrong columns at that stage included:

- `acceptedTaxonKey`
- `speciesKey`
- `genericName`
- `infraspecificEpithet`
- `scientificNameAuthorship`

## Reproduction commands

```bash
uv run python -m ml.scripts.eval.run_eval_suite

uv run python -m ml.scripts.eval.evaluate_instance \
    --annotations ml/data/eval/annotations/niamoto-nc.yml \
    --data-dir test-instance/niamoto-nc/imports --compare

uv run python -m ml.scripts.data.build_gold_set
uv run python -m ml.scripts.train.train_header_model
uv run python -m ml.scripts.train.train_value_model
uv run python -m ml.scripts.train.train_fusion
```