ML Detection — Training & Evaluation Guide¶

Status: Active Audience: Team, AI agents Purpose: Operational reference for data, training, evaluation, and improvement cycles

This guide explains how to build the gold set, train the three ML branches, evaluate the stack, and decide what kind of improvement is needed next.

Pipeline overview¶

ml/data/silver/          ->  build_gold_set.py  ->  ml/data/gold_set.json
                                                     |
                                              +------+------+
                                              |             |
                                              v             v
                                     train_header_model   train_value_model
                                              |             |
                                              +------+------+
                                                     |
                                                     v
                                              train_fusion
                                                     |
                                                     v
                                          ml/models/*.joblib
                                                     |
                             column_aliases.yaml --->|
                                                     v
                               evaluate.py / run_eval_suite.py
                                                     |
                                                     v
                                  ml/data/eval/results/*.json

1. Source data¶

Silver data¶

ml/data/silver/ contains real ecological tabular sources used to enrich the gold set:

forest inventories
GBIF exports
trait datasets
tropical field datasets
standards-based tabular sources such as TAXREF, ETS, and sPlotOpen

These files are the raw material for training data construction.

Niamoto instance datasets¶

The tested instance datasets remain important because they represent the actual product target:

test-instance/niamoto-nc/imports/
test-instance/niamoto-gb/imports/

Evaluation annotations¶

Independent ground truth lives in ml/data/eval/annotations/.

This is distinct from the gold set:

gold set = training data
eval annotations = benchmark data

Do not treat them as interchangeable, even when some columns overlap.

2. Gold set¶

The gold set is the training dataset. Each entry represents one labelled column, with:

column_name
concept_coarse
role
sampled values
dataset metadata

Build the gold set¶

uv run python -m ml.scripts.data.build_gold_set

Output:

ml/data/gold_set.json

Add a new source¶

In ml/scripts/data/build_gold_set.py:

Define a label dictionary:

MY_LABELS = {
    "dbh": ("measurement.diameter", "measurement"),
    "species": ("taxonomy.species", "taxonomy"),
    "plot_id": ("identifier.plot", "identifier"),
}

{
    "name": "my_dataset",
    "path": ML_ROOT / "data/silver/my_file.csv",
    "labels": MY_LABELS,
    "language": "en",
    "sample_rows": 1000,
}

Rebuild the gold set.

Concept taxonomy¶

Fine-grained concepts are merged into a coarser training taxonomy through ml/scripts/data/concept_taxonomy.py.

Example:

category.phenology -> category.ecology
measurement.basal_area -> measurement.biomass

Always verify the merge logic before adding new fine concepts, because an incorrect merge can bias the whole stack.

3. Training¶

All three models train from ml/data/gold_set.json.

Header model¶

uv run python -m ml.scripts.train.train_header_model

TF-IDF character n-grams + Logistic Regression
strongest branch when headers are informative
outputs ml/models/header_model.joblib
local metric: macro-F1 on column names

Value model¶

uv run python -m ml.scripts.train.train_value_model

statistical and pattern features + HistGradientBoosting
useful for anonymous or ambiguous headers
outputs ml/models/value_model.joblib
local metric: macro-F1 on value-derived features

Fusion model¶

uv run python -m ml.scripts.train.train_fusion

combines header/value probabilities and meta-features
outputs ml/models/fusion_model.joblib
evaluated with leak-aware GroupKFold by dataset

Full retrain¶

uv run python -m ml.scripts.data.build_gold_set
uv run python -m ml.scripts.train.train_header_model
uv run python -m ml.scripts.train.train_value_model
uv run python -m ml.scripts.train.train_fusion

4. Alias registry¶

The alias registry is the high-precision fast path checked before ML.

File:

src/niamoto/core/imports/ml/column_aliases.yaml

Format:

concept.subconcept:
  en: [alias1, alias2]
  fr: [alias_fr1, alias_fr2]
  dwc: [darwin_core_name]

Add an alias when:

the header is genuinely unambiguous
there is no cross-concept ambiguity
the ML stack repeatedly misses a stable real-world header

Quick check:

uv run python -c "
from niamoto.core.imports.ml.alias_registry import AliasRegistry
reg = AliasRegistry()
print(reg.match('my_column_name'))
"

Tests:

uv run pytest tests/core/imports/test_alias_registry.py -v

5. Evaluation¶

Annotated datasets¶

Current benchmark annotations live in ml/data/eval/annotations/.

Typical files:

niamoto-nc.yml
niamoto-gb.yml
guyadiv.yml
gbif_darwin_core.yml
silver.yml

The YAML format is column_name: role.concept.

Full real-dataset suite¶

uv run python -m ml.scripts.eval.run_eval_suite

This runs the annotated dataset benchmark and writes timestamped JSON files to:

ml/data/eval/results/

Single dataset evaluation¶

uv run python -m ml.scripts.eval.evaluate_instance \
    --annotations ml/data/eval/annotations/niamoto-nc.yml \
    --data-dir test-instance/niamoto-nc/imports --compare

Other common variants:

uv run python -m ml.scripts.eval.evaluate_instance \
    --annotations ml/data/eval/annotations/gbif_darwin_core.yml \
    --csv ml/data/silver/gbif_targeted/new_caledonia/occurrences.csv

uv run python -m ml.scripts.eval.evaluate_instance \
    --annotations ml/data/eval/annotations/silver.yml \
    --data-dir ml/data/silver

Tier-only evaluation¶

uv run python -m ml.scripts.eval.run_eval_suite --tier 1
uv run python -m ml.scripts.eval.run_eval_suite --tier gbif
uv run python -m ml.scripts.eval.run_eval_suite --tier acceptance

Gold-set / holdout evaluation¶

Use evaluate.py for the internal benchmark built from the gold set and holdout protocol:

uv run python -m ml.scripts.eval.evaluate --model values --metric macro-f1 --splits 5
uv run python -m ml.scripts.eval.evaluate --model fusion --metric macro-f1 --splits 5
uv run python -m ml.scripts.eval.evaluate --model all --metric product-score --splits 3
uv run python -m ml.scripts.eval.evaluate --model all --metric niamoto-score --splits 3

6. Improvement cycle¶

After an evaluation pass, identify:

Weak concepts: low accuracy, possibly absent or underrepresented in the gold set
Systematically wrong headers: likely alias candidates
Top confusions: concept A repeatedly predicted as B

Choose the action¶

Diagnosis	Action	Typical impact
Concept missing from gold set	Add labels in `build_gold_set.py`	Requires rebuild + retrain
Stable unambiguous header missed	Add alias in `column_aliases.yaml`	Immediate, no retrain
Concept present but confused	Inspect `concept_taxonomy.py` or feature space	Rebuild + retrain
Evaluation annotation is wrong	Fix `ml/data/eval/annotations/`	Re-run eval only
Gold set overrepresentation bias	Rebalance or enrich the data	Retrain

Verify annotations against real values¶

Before assuming the model is wrong, inspect the actual column values:

uv run python -c "
import pandas as pd
df = pd.read_csv('path/to/file.csv', nrows=10)
print(df['column_name'].head())
"

Header-based assumptions can be misleading if the values tell another story.

Protect benchmark integrity¶

Keep the separation clear:

Gold set: training material
Eval annotations: independent benchmark

If the same columns appear in both, interpret the scores carefully and keep the labels aligned.

Quick reference¶

# Full build -> train -> evaluate
uv run python -m ml.scripts.data.build_gold_set
uv run python -m ml.scripts.train.train_header_model
uv run python -m ml.scripts.train.train_value_model
uv run python -m ml.scripts.train.train_fusion
uv run python -m ml.scripts.eval.run_eval_suite

# Alias-only improvement path
uv run pytest tests/core/imports/test_alias_registry.py -v
uv run python -m ml.scripts.eval.run_eval_suite

# Internal benchmark only
uv run python -m ml.scripts.eval.evaluate --model all --metric niamoto-score --splits 3