ML Detection Branch Architecture¶

Status: Active Audience: Team, AI agents, curious developers Purpose: Architecture reference for the hybrid pipeline and its decision criteria

Purpose¶

This document describes what the feat/ml-detection-improvement branch aims to achieve, the architecture adopted, and how autoresearch should be used in Niamoto.

The subject is no longer just “detect a column type”. The goal is to produce detection good enough to auto-configure an import, build a semantic_profile, and propose useful affordances and suggestions without relying on an LLM.

Product Objective¶

The product objective is not academic perfection on fine-grained concepts. The system must primarily:

recognise the correct role of a column;
recognise a few critical concepts that change product behaviour;
perform well on new, multilingual, and partially anonymous datasets;
avoid high-confidence false positives;
feed a usable semantic_profile for transformer/widget suggestions.

In practice, a confusion between measurement.height and measurement.diameter is less serious than a confusion between identifier.plot and statistic.count.

Adopted Architecture¶

The branch has converged on a local, compact, and explainable hybrid pipeline:

Exact aliases
Header branch
Values branch
Fusion
Product semantic projection

1. Exact Aliases¶

Aliases provide a high-precision fast path for known column names. They remain essential, but must stay conservative:

an ambiguous alias should be disabled;
an exact alias must not bypass the classifier if doing so creates false positives at confidence 1.0.

References:

src/niamoto/core/imports/ml/alias_registry.py
src/niamoto/core/imports/ml/column_aliases.yaml

2. Header Branch¶

The header branch classifies the column name from a normalised enriched text. It is the best-performing branch when the header is informative.

Technology:

TF-IDF char n-grams
Logistic Regression

References:

ml/scripts/train/train_header_model.py
src/niamoto/core/imports/ml/header_features.py

3. Values Branch¶

The values branch learns from statistics and patterns extracted from the values:

numerical distributions;
simple regexes;
booleans, dates, coordinates;
signals from encoded/categorical columns.

It is less accurate alone than header, but it is decisive for:

anonymous headers;
ambiguous cases;
certain concepts that are strongly detectable by pattern.

References:

ml/scripts/train/train_value_model.py
src/niamoto/core/imports/ml/value_features.py

4. Fusion¶

Fusion combines the two branches in a shared concept space. It must not be a simple implicit average:

it receives the aligned probabilities from both branches;
it uses confidence and disagreement meta-features;
it can integrate targeted guardrails for frequent errors.

Fusion is the right layer for correcting cases where one branch becomes too dominant on a particular domain.

References:

ml/scripts/train/train_fusion.py
src/niamoto/core/imports/ml/fusion_features.py

5. Product Semantic Projection¶

The real product output is not just a raw concept. The current branch projects detection towards:

a role
a concept
affordances and suggestions

This layer is what aligns detection with the Niamoto product.

References:

src/niamoto/core/imports/ml/semantic_profile.py
src/niamoto/core/imports/ml/affordance_matcher.py
src/niamoto/core/imports/profiler.py

Why This Architecture¶

This architecture is suited to the real constraints of the project:

limited annotated data relative to the number of concepts;
high heterogeneity of datasets;
multilingual;
need for explainability;
local execution;
short training cost;
product value closer to the correct role and correct suggestion than to the perfect fine-grained concept.

A larger end-to-end approach would be more fragile here than a compact hybrid system with targeted rules.

What We Are Really Trying to Improve¶

The branch does not aim to maximise a simple classification score. It aims to improve:

the correct auto-configuration rate;
robustness on new datasets;
handling of anonymous columns;
the quality of output suggestions;
the ability to abstain or remain cautious on hard cases.

Retained Evaluation Ground Truth¶

The final metric targeted by the branch is the NiamotoOfflineScore, computed in ml/scripts/eval/evaluation.py and exposed by ml/scripts/eval/evaluate.py.

The score combines:

role_macro_f1
critical_concept_macro_f1
anonymous_role_macro_f1
pair_consistency
confidence_quality
dataset_outcome

The important holdouts are:

languages: fr, es, de, zh
families: dwc_gbif, forest_inventory, tropical_field, research_traits
anonymous columns

Role of Autoresearch¶

autoresearch must not decide the architecture. It must locally optimise an already well-framed system.

Expected role:

propose bounded variants;
evaluate quickly;
keep improvements;
reject regressions;
accelerate tuning.

What it must not do:

change the product ground truth;
optimise a proxy score at the expense of guardrails;
introduce unvalidated aggressive rules;
silently degrade a hard holdout to gain elsewhere.

Recommended Autoresearch Programmes¶

Three loop levels are useful:

ml/programmes/niamoto-header-model.md
ml/programmes/niamoto-values-model.md
ml/programmes/niamoto-fusion.md

The fusion programme now plays the role of the full-stack programme.

Current Guardrails¶

Results observed on the branch indicate that certain domains must be treated as explicit guardrails:

forest_inventory
tropical_field
fr

The largest risk identified at this stage is over-prediction of ill-suited concepts such as statistic.count on business-coded columns.

Recommended Direction¶

The right direction is not “more model”. The right direction is:

better train/runtime consistency;
better evaluation;
better fusion;
cautious targeted rules;
better use of dataset-level context.

The most promising part of the branch remains the alignment:

detection -> semantic_profile -> affordances -> suggestions

and not merely the optimisation of an isolated classifier.