# ML Detection Branch Architecture > Status: Active > Audience: Team, AI agents, curious developers > Purpose: Architecture reference for the hybrid pipeline and its decision > criteria ## Purpose This document describes what the `feat/ml-detection-improvement` branch aims to achieve, the architecture adopted, and how `autoresearch` should be used in Niamoto. The subject is no longer just "detect a column type". The goal is to produce detection good enough to auto-configure an import, build a `semantic_profile`, and propose useful affordances and suggestions without relying on an LLM. ## Product Objective The product objective is not academic perfection on fine-grained concepts. The system must primarily: - recognise the correct **role** of a column; - recognise a few **critical concepts** that change product behaviour; - perform well on new, multilingual, and partially anonymous datasets; - avoid high-confidence false positives; - feed a usable `semantic_profile` for transformer/widget suggestions. In practice, a confusion between `measurement.height` and `measurement.diameter` is less serious than a confusion between `identifier.plot` and `statistic.count`. ## Adopted Architecture The branch has converged on a local, compact, and explainable hybrid pipeline: 1. **Exact aliases** 2. **Header branch** 3. **Values branch** 4. **Fusion** 5. **Product semantic projection** ### 1. Exact Aliases Aliases provide a high-precision fast path for known column names. They remain essential, but must stay conservative: - an ambiguous alias should be disabled; - an exact alias must not bypass the classifier if doing so creates false positives at confidence 1.0. References: - `src/niamoto/core/imports/ml/alias_registry.py` - `src/niamoto/core/imports/ml/column_aliases.yaml` ### 2. Header Branch The `header` branch classifies the column name from a normalised enriched text. It is the best-performing branch when the header is informative. Technology: - TF-IDF char n-grams - Logistic Regression References: - `ml/scripts/train/train_header_model.py` - `src/niamoto/core/imports/ml/header_features.py` ### 3. Values Branch The `values` branch learns from statistics and patterns extracted from the values: - numerical distributions; - simple regexes; - booleans, dates, coordinates; - signals from encoded/categorical columns. It is less accurate alone than `header`, but it is decisive for: - anonymous headers; - ambiguous cases; - certain concepts that are strongly detectable by pattern. References: - `ml/scripts/train/train_value_model.py` - `src/niamoto/core/imports/ml/value_features.py` ### 4. Fusion Fusion combines the two branches in a shared concept space. It must not be a simple implicit average: - it receives the aligned probabilities from both branches; - it uses confidence and disagreement meta-features; - it can integrate targeted guardrails for frequent errors. Fusion is the right layer for correcting cases where one branch becomes too dominant on a particular domain. References: - `ml/scripts/train/train_fusion.py` - `src/niamoto/core/imports/ml/fusion_features.py` ### 5. Product Semantic Projection The real product output is not just a raw concept. The current branch projects detection towards: - a `role` - a `concept` - affordances and suggestions This layer is what aligns detection with the Niamoto product. References: - `src/niamoto/core/imports/ml/semantic_profile.py` - `src/niamoto/core/imports/ml/affordance_matcher.py` - `src/niamoto/core/imports/profiler.py` ## Why This Architecture This architecture is suited to the real constraints of the project: - limited annotated data relative to the number of concepts; - high heterogeneity of datasets; - multilingual; - need for explainability; - local execution; - short training cost; - product value closer to the correct role and correct suggestion than to the perfect fine-grained concept. A larger end-to-end approach would be more fragile here than a compact hybrid system with targeted rules. ## What We Are Really Trying to Improve The branch does not aim to maximise a simple classification score. It aims to improve: - the correct auto-configuration rate; - robustness on new datasets; - handling of anonymous columns; - the quality of output suggestions; - the ability to abstain or remain cautious on hard cases. ## Retained Evaluation Ground Truth The final metric targeted by the branch is the `NiamotoOfflineScore`, computed in `ml/scripts/eval/evaluation.py` and exposed by `ml/scripts/eval/evaluate.py`. The score combines: - `role_macro_f1` - `critical_concept_macro_f1` - `anonymous_role_macro_f1` - `pair_consistency` - `confidence_quality` - `dataset_outcome` The important holdouts are: - languages: `fr`, `es`, `de`, `zh` - families: `dwc_gbif`, `forest_inventory`, `tropical_field`, `research_traits` - anonymous columns ## Role of Autoresearch `autoresearch` must not decide the architecture. It must locally optimise an already well-framed system. Expected role: - propose bounded variants; - evaluate quickly; - keep improvements; - reject regressions; - accelerate tuning. What it must not do: - change the product ground truth; - optimise a proxy score at the expense of guardrails; - introduce unvalidated aggressive rules; - silently degrade a hard holdout to gain elsewhere. ## Recommended Autoresearch Programmes Three loop levels are useful: - `ml/programmes/niamoto-header-model.md` - `ml/programmes/niamoto-values-model.md` - `ml/programmes/niamoto-fusion.md` The `fusion` programme now plays the role of the **full-stack** programme. ## Current Guardrails Results observed on the branch indicate that certain domains must be treated as explicit guardrails: - `forest_inventory` - `tropical_field` - `fr` The largest risk identified at this stage is over-prediction of ill-suited concepts such as `statistic.count` on business-coded columns. ## Recommended Direction The right direction is not "more model". The right direction is: - better train/runtime consistency; - better evaluation; - better fusion; - cautious targeted rules; - better use of dataset-level context. The most promising part of the branch remains the alignment: `detection -> semantic_profile -> affordances -> suggestions` and not merely the optimisation of an isolated classifier.