ML Detection Branch Architecture¶
Status: Active Audience: Team, AI agents, curious developers Purpose: Architecture reference for the hybrid pipeline and its decision criteria
Purpose¶
This document describes what the feat/ml-detection-improvement branch aims to
achieve, the architecture adopted, and how autoresearch should be used in
Niamoto.
The subject is no longer just “detect a column type”. The goal is to produce
detection good enough to auto-configure an import, build a semantic_profile,
and propose useful affordances and suggestions without relying on an LLM.
Product Objective¶
The product objective is not academic perfection on fine-grained concepts. The system must primarily:
recognise the correct role of a column;
recognise a few critical concepts that change product behaviour;
perform well on new, multilingual, and partially anonymous datasets;
avoid high-confidence false positives;
feed a usable
semantic_profilefor transformer/widget suggestions.
In practice, a confusion between measurement.height and measurement.diameter
is less serious than a confusion between identifier.plot and statistic.count.
Adopted Architecture¶
The branch has converged on a local, compact, and explainable hybrid pipeline:
Exact aliases
Header branch
Values branch
Fusion
Product semantic projection
1. Exact Aliases¶
Aliases provide a high-precision fast path for known column names. They remain essential, but must stay conservative:
an ambiguous alias should be disabled;
an exact alias must not bypass the classifier if doing so creates false positives at confidence 1.0.
References:
src/niamoto/core/imports/ml/alias_registry.pysrc/niamoto/core/imports/ml/column_aliases.yaml
2. Header Branch¶
The header branch classifies the column name from a normalised enriched text.
It is the best-performing branch when the header is informative.
Technology:
TF-IDF char n-grams
Logistic Regression
References:
ml/scripts/train/train_header_model.pysrc/niamoto/core/imports/ml/header_features.py
3. Values Branch¶
The values branch learns from statistics and patterns extracted from the
values:
numerical distributions;
simple regexes;
booleans, dates, coordinates;
signals from encoded/categorical columns.
It is less accurate alone than header, but it is decisive for:
anonymous headers;
ambiguous cases;
certain concepts that are strongly detectable by pattern.
References:
ml/scripts/train/train_value_model.pysrc/niamoto/core/imports/ml/value_features.py
4. Fusion¶
Fusion combines the two branches in a shared concept space. It must not be a simple implicit average:
it receives the aligned probabilities from both branches;
it uses confidence and disagreement meta-features;
it can integrate targeted guardrails for frequent errors.
Fusion is the right layer for correcting cases where one branch becomes too dominant on a particular domain.
References:
ml/scripts/train/train_fusion.pysrc/niamoto/core/imports/ml/fusion_features.py
5. Product Semantic Projection¶
The real product output is not just a raw concept. The current branch projects detection towards:
a
rolea
conceptaffordances and suggestions
This layer is what aligns detection with the Niamoto product.
References:
src/niamoto/core/imports/ml/semantic_profile.pysrc/niamoto/core/imports/ml/affordance_matcher.pysrc/niamoto/core/imports/profiler.py
Why This Architecture¶
This architecture is suited to the real constraints of the project:
limited annotated data relative to the number of concepts;
high heterogeneity of datasets;
multilingual;
need for explainability;
local execution;
short training cost;
product value closer to the correct role and correct suggestion than to the perfect fine-grained concept.
A larger end-to-end approach would be more fragile here than a compact hybrid system with targeted rules.
What We Are Really Trying to Improve¶
The branch does not aim to maximise a simple classification score. It aims to improve:
the correct auto-configuration rate;
robustness on new datasets;
handling of anonymous columns;
the quality of output suggestions;
the ability to abstain or remain cautious on hard cases.
Retained Evaluation Ground Truth¶
The final metric targeted by the branch is the NiamotoOfflineScore, computed
in ml/scripts/eval/evaluation.py and exposed by
ml/scripts/eval/evaluate.py.
The score combines:
role_macro_f1critical_concept_macro_f1anonymous_role_macro_f1pair_consistencyconfidence_qualitydataset_outcome
The important holdouts are:
languages:
fr,es,de,zhfamilies:
dwc_gbif,forest_inventory,tropical_field,research_traitsanonymous columns
Role of Autoresearch¶
autoresearch must not decide the architecture. It must locally optimise an
already well-framed system.
Expected role:
propose bounded variants;
evaluate quickly;
keep improvements;
reject regressions;
accelerate tuning.
What it must not do:
change the product ground truth;
optimise a proxy score at the expense of guardrails;
introduce unvalidated aggressive rules;
silently degrade a hard holdout to gain elsewhere.
Recommended Autoresearch Programmes¶
Three loop levels are useful:
ml/programmes/niamoto-header-model.mdml/programmes/niamoto-values-model.mdml/programmes/niamoto-fusion.md
The fusion programme now plays the role of the full-stack programme.
Current Guardrails¶
Results observed on the branch indicate that certain domains must be treated as explicit guardrails:
forest_inventorytropical_fieldfr
The largest risk identified at this stage is over-prediction of ill-suited
concepts such as statistic.count on business-coded columns.
Recommended Direction¶
The right direction is not “more model”. The right direction is:
better train/runtime consistency;
better evaluation;
better fusion;
cautious targeted rules;
better use of dataset-level context.
The most promising part of the branch remains the alignment:
detection -> semantic_profile -> affordances -> suggestions
and not merely the optimisation of an isolated classifier.