# ML Detection — Training & Evaluation Guide > Status: Active > Audience: Team, AI agents > Purpose: Operational reference for data, training, evaluation, and > improvement cycles This guide explains how to build the gold set, train the three ML branches, evaluate the stack, and decide what kind of improvement is needed next. ## Pipeline overview ```text ml/data/silver/ -> build_gold_set.py -> ml/data/gold_set.json | +------+------+ | | v v train_header_model train_value_model | | +------+------+ | v train_fusion | v ml/models/*.joblib | column_aliases.yaml --->| v evaluate.py / run_eval_suite.py | v ml/data/eval/results/*.json ``` ## 1. Source data ### Silver data `ml/data/silver/` contains real ecological tabular sources used to enrich the gold set: - forest inventories - GBIF exports - trait datasets - tropical field datasets - standards-based tabular sources such as TAXREF, ETS, and sPlotOpen These files are the raw material for training data construction. ### Niamoto instance datasets The tested instance datasets remain important because they represent the actual product target: - `test-instance/niamoto-nc/imports/` - `test-instance/niamoto-gb/imports/` ### Evaluation annotations Independent ground truth lives in `ml/data/eval/annotations/`. This is distinct from the gold set: - **gold set** = training data - **eval annotations** = benchmark data Do not treat them as interchangeable, even when some columns overlap. ## 2. Gold set The gold set is the training dataset. Each entry represents one labelled column, with: - `column_name` - `concept_coarse` - `role` - sampled values - dataset metadata ### Build the gold set ```bash uv run python -m ml.scripts.data.build_gold_set ``` Output: - `ml/data/gold_set.json` ### Add a new source In `ml/scripts/data/build_gold_set.py`: 1. Define a label dictionary: ```python MY_LABELS = { "dbh": ("measurement.diameter", "measurement"), "species": ("taxonomy.species", "taxonomy"), "plot_id": ("identifier.plot", "identifier"), } ``` 2. Register the source in `SOURCES`: ```python { "name": "my_dataset", "path": ML_ROOT / "data/silver/my_file.csv", "labels": MY_LABELS, "language": "en", "sample_rows": 1000, } ``` 3. Rebuild the gold set. ### Concept taxonomy Fine-grained concepts are merged into a coarser training taxonomy through `ml/scripts/data/concept_taxonomy.py`. Example: - `category.phenology` -> `category.ecology` - `measurement.basal_area` -> `measurement.biomass` Always verify the merge logic before adding new fine concepts, because an incorrect merge can bias the whole stack. ## 3. Training All three models train from `ml/data/gold_set.json`. ### Header model ```bash uv run python -m ml.scripts.train.train_header_model ``` - TF-IDF character n-grams + Logistic Regression - strongest branch when headers are informative - outputs `ml/models/header_model.joblib` - local metric: macro-F1 on column names ### Value model ```bash uv run python -m ml.scripts.train.train_value_model ``` - statistical and pattern features + HistGradientBoosting - useful for anonymous or ambiguous headers - outputs `ml/models/value_model.joblib` - local metric: macro-F1 on value-derived features ### Fusion model ```bash uv run python -m ml.scripts.train.train_fusion ``` - combines header/value probabilities and meta-features - outputs `ml/models/fusion_model.joblib` - evaluated with leak-aware GroupKFold by dataset ### Full retrain ```bash uv run python -m ml.scripts.data.build_gold_set uv run python -m ml.scripts.train.train_header_model uv run python -m ml.scripts.train.train_value_model uv run python -m ml.scripts.train.train_fusion ``` ## 4. Alias registry The alias registry is the high-precision fast path checked before ML. File: - `src/niamoto/core/imports/ml/column_aliases.yaml` Format: ```yaml concept.subconcept: en: [alias1, alias2] fr: [alias_fr1, alias_fr2] dwc: [darwin_core_name] ``` Add an alias when: - the header is genuinely unambiguous - there is no cross-concept ambiguity - the ML stack repeatedly misses a stable real-world header Quick check: ```bash uv run python -c " from niamoto.core.imports.ml.alias_registry import AliasRegistry reg = AliasRegistry() print(reg.match('my_column_name')) " ``` Tests: ```bash uv run pytest tests/core/imports/test_alias_registry.py -v ``` ## 5. Evaluation ### Annotated datasets Current benchmark annotations live in `ml/data/eval/annotations/`. Typical files: - `niamoto-nc.yml` - `niamoto-gb.yml` - `guyadiv.yml` - `gbif_darwin_core.yml` - `silver.yml` The YAML format is `column_name: role.concept`. ### Full real-dataset suite ```bash uv run python -m ml.scripts.eval.run_eval_suite ``` This runs the annotated dataset benchmark and writes timestamped JSON files to: - `ml/data/eval/results/` ### Single dataset evaluation ```bash uv run python -m ml.scripts.eval.evaluate_instance \ --annotations ml/data/eval/annotations/niamoto-nc.yml \ --data-dir test-instance/niamoto-nc/imports --compare ``` Other common variants: ```bash uv run python -m ml.scripts.eval.evaluate_instance \ --annotations ml/data/eval/annotations/gbif_darwin_core.yml \ --csv ml/data/silver/gbif_targeted/new_caledonia/occurrences.csv uv run python -m ml.scripts.eval.evaluate_instance \ --annotations ml/data/eval/annotations/silver.yml \ --data-dir ml/data/silver ``` ### Tier-only evaluation ```bash uv run python -m ml.scripts.eval.run_eval_suite --tier 1 uv run python -m ml.scripts.eval.run_eval_suite --tier gbif uv run python -m ml.scripts.eval.run_eval_suite --tier acceptance ``` ### Gold-set / holdout evaluation Use `evaluate.py` for the internal benchmark built from the gold set and holdout protocol: ```bash uv run python -m ml.scripts.eval.evaluate --model values --metric macro-f1 --splits 5 uv run python -m ml.scripts.eval.evaluate --model fusion --metric macro-f1 --splits 5 uv run python -m ml.scripts.eval.evaluate --model all --metric product-score --splits 3 uv run python -m ml.scripts.eval.evaluate --model all --metric niamoto-score --splits 3 ``` ## 6. Improvement cycle After an evaluation pass, identify: 1. **Weak concepts**: low accuracy, possibly absent or underrepresented in the gold set 2. **Systematically wrong headers**: likely alias candidates 3. **Top confusions**: concept A repeatedly predicted as B ### Choose the action | Diagnosis | Action | Typical impact | |-----------|--------|----------------| | Concept missing from gold set | Add labels in `build_gold_set.py` | Requires rebuild + retrain | | Stable unambiguous header missed | Add alias in `column_aliases.yaml` | Immediate, no retrain | | Concept present but confused | Inspect `concept_taxonomy.py` or feature space | Rebuild + retrain | | Evaluation annotation is wrong | Fix `ml/data/eval/annotations/` | Re-run eval only | | Gold set overrepresentation bias | Rebalance or enrich the data | Retrain | ### Verify annotations against real values Before assuming the model is wrong, inspect the actual column values: ```bash uv run python -c " import pandas as pd df = pd.read_csv('path/to/file.csv', nrows=10) print(df['column_name'].head()) " ``` Header-based assumptions can be misleading if the values tell another story. ### Protect benchmark integrity Keep the separation clear: - **Gold set**: training material - **Eval annotations**: independent benchmark If the same columns appear in both, interpret the scores carefully and keep the labels aligned. ## Quick reference ```bash # Full build -> train -> evaluate uv run python -m ml.scripts.data.build_gold_set uv run python -m ml.scripts.train.train_header_model uv run python -m ml.scripts.train.train_value_model uv run python -m ml.scripts.train.train_fusion uv run python -m ml.scripts.eval.run_eval_suite # Alias-only improvement path uv run pytest tests/core/imports/test_alias_registry.py -v uv run python -m ml.scripts.eval.run_eval_suite # Internal benchmark only uv run python -m ml.scripts.eval.evaluate --model all --metric niamoto-score --splits 3 ```