ML Detection — Training & Evaluation Guide¶
Status: Active Audience: Team, AI agents Purpose: Operational reference for data, training, evaluation, and improvement cycles
This guide explains how to build the gold set, train the three ML branches, evaluate the stack, and decide what kind of improvement is needed next.
Pipeline overview¶
ml/data/silver/ -> build_gold_set.py -> ml/data/gold_set.json
|
+------+------+
| |
v v
train_header_model train_value_model
| |
+------+------+
|
v
train_fusion
|
v
ml/models/*.joblib
|
column_aliases.yaml --->|
v
evaluate.py / run_eval_suite.py
|
v
ml/data/eval/results/*.json
1. Source data¶
Silver data¶
ml/data/silver/ contains real ecological tabular sources used to enrich the
gold set:
forest inventories
GBIF exports
trait datasets
tropical field datasets
standards-based tabular sources such as TAXREF, ETS, and sPlotOpen
These files are the raw material for training data construction.
Niamoto instance datasets¶
The tested instance datasets remain important because they represent the actual product target:
test-instance/niamoto-nc/imports/test-instance/niamoto-gb/imports/
Evaluation annotations¶
Independent ground truth lives in ml/data/eval/annotations/.
This is distinct from the gold set:
gold set = training data
eval annotations = benchmark data
Do not treat them as interchangeable, even when some columns overlap.
2. Gold set¶
The gold set is the training dataset. Each entry represents one labelled column, with:
column_nameconcept_coarserolesampled values
dataset metadata
Build the gold set¶
uv run python -m ml.scripts.data.build_gold_set
Output:
ml/data/gold_set.json
Add a new source¶
In ml/scripts/data/build_gold_set.py:
Define a label dictionary:
MY_LABELS = {
"dbh": ("measurement.diameter", "measurement"),
"species": ("taxonomy.species", "taxonomy"),
"plot_id": ("identifier.plot", "identifier"),
}
Register the source in
SOURCES:
{
"name": "my_dataset",
"path": ML_ROOT / "data/silver/my_file.csv",
"labels": MY_LABELS,
"language": "en",
"sample_rows": 1000,
}
Rebuild the gold set.
Concept taxonomy¶
Fine-grained concepts are merged into a coarser training taxonomy through
ml/scripts/data/concept_taxonomy.py.
Example:
category.phenology->category.ecologymeasurement.basal_area->measurement.biomass
Always verify the merge logic before adding new fine concepts, because an incorrect merge can bias the whole stack.
3. Training¶
All three models train from ml/data/gold_set.json.
Header model¶
uv run python -m ml.scripts.train.train_header_model
TF-IDF character n-grams + Logistic Regression
strongest branch when headers are informative
outputs
ml/models/header_model.jobliblocal metric: macro-F1 on column names
Value model¶
uv run python -m ml.scripts.train.train_value_model
statistical and pattern features + HistGradientBoosting
useful for anonymous or ambiguous headers
outputs
ml/models/value_model.jobliblocal metric: macro-F1 on value-derived features
Fusion model¶
uv run python -m ml.scripts.train.train_fusion
combines header/value probabilities and meta-features
outputs
ml/models/fusion_model.joblibevaluated with leak-aware GroupKFold by dataset
Full retrain¶
uv run python -m ml.scripts.data.build_gold_set
uv run python -m ml.scripts.train.train_header_model
uv run python -m ml.scripts.train.train_value_model
uv run python -m ml.scripts.train.train_fusion
4. Alias registry¶
The alias registry is the high-precision fast path checked before ML.
File:
src/niamoto/core/imports/ml/column_aliases.yaml
Format:
concept.subconcept:
en: [alias1, alias2]
fr: [alias_fr1, alias_fr2]
dwc: [darwin_core_name]
Add an alias when:
the header is genuinely unambiguous
there is no cross-concept ambiguity
the ML stack repeatedly misses a stable real-world header
Quick check:
uv run python -c "
from niamoto.core.imports.ml.alias_registry import AliasRegistry
reg = AliasRegistry()
print(reg.match('my_column_name'))
"
Tests:
uv run pytest tests/core/imports/test_alias_registry.py -v
5. Evaluation¶
Annotated datasets¶
Current benchmark annotations live in ml/data/eval/annotations/.
Typical files:
niamoto-nc.ymlniamoto-gb.ymlguyadiv.ymlgbif_darwin_core.ymlsilver.yml
The YAML format is column_name: role.concept.
Full real-dataset suite¶
uv run python -m ml.scripts.eval.run_eval_suite
This runs the annotated dataset benchmark and writes timestamped JSON files to:
ml/data/eval/results/
Single dataset evaluation¶
uv run python -m ml.scripts.eval.evaluate_instance \
--annotations ml/data/eval/annotations/niamoto-nc.yml \
--data-dir test-instance/niamoto-nc/imports --compare
Other common variants:
uv run python -m ml.scripts.eval.evaluate_instance \
--annotations ml/data/eval/annotations/gbif_darwin_core.yml \
--csv ml/data/silver/gbif_targeted/new_caledonia/occurrences.csv
uv run python -m ml.scripts.eval.evaluate_instance \
--annotations ml/data/eval/annotations/silver.yml \
--data-dir ml/data/silver
Tier-only evaluation¶
uv run python -m ml.scripts.eval.run_eval_suite --tier 1
uv run python -m ml.scripts.eval.run_eval_suite --tier gbif
uv run python -m ml.scripts.eval.run_eval_suite --tier acceptance
Gold-set / holdout evaluation¶
Use evaluate.py for the internal benchmark built from the gold set and
holdout protocol:
uv run python -m ml.scripts.eval.evaluate --model values --metric macro-f1 --splits 5
uv run python -m ml.scripts.eval.evaluate --model fusion --metric macro-f1 --splits 5
uv run python -m ml.scripts.eval.evaluate --model all --metric product-score --splits 3
uv run python -m ml.scripts.eval.evaluate --model all --metric niamoto-score --splits 3
6. Improvement cycle¶
After an evaluation pass, identify:
Weak concepts: low accuracy, possibly absent or underrepresented in the gold set
Systematically wrong headers: likely alias candidates
Top confusions: concept A repeatedly predicted as B
Choose the action¶
Diagnosis |
Action |
Typical impact |
|---|---|---|
Concept missing from gold set |
Add labels in |
Requires rebuild + retrain |
Stable unambiguous header missed |
Add alias in |
Immediate, no retrain |
Concept present but confused |
Inspect |
Rebuild + retrain |
Evaluation annotation is wrong |
Fix |
Re-run eval only |
Gold set overrepresentation bias |
Rebalance or enrich the data |
Retrain |
Verify annotations against real values¶
Before assuming the model is wrong, inspect the actual column values:
uv run python -c "
import pandas as pd
df = pd.read_csv('path/to/file.csv', nrows=10)
print(df['column_name'].head())
"
Header-based assumptions can be misleading if the values tell another story.
Protect benchmark integrity¶
Keep the separation clear:
Gold set: training material
Eval annotations: independent benchmark
If the same columns appear in both, interpret the scores carefully and keep the labels aligned.
Quick reference¶
# Full build -> train -> evaluate
uv run python -m ml.scripts.data.build_gold_set
uv run python -m ml.scripts.train.train_header_model
uv run python -m ml.scripts.train.train_value_model
uv run python -m ml.scripts.train.train_fusion
uv run python -m ml.scripts.eval.run_eval_suite
# Alias-only improvement path
uv run pytest tests/core/imports/test_alias_registry.py -v
uv run python -m ml.scripts.eval.run_eval_suite
# Internal benchmark only
uv run python -m ml.scripts.eval.evaluate --model all --metric niamoto-score --splits 3