ML Detection Branch Architecture

Status: Active Audience: Team, AI agents, curious developers Purpose: Architecture reference for the hybrid pipeline and its decision criteria

Purpose

This document describes what the feat/ml-detection-improvement branch aims to achieve, the architecture adopted, and how autoresearch should be used in Niamoto.

The subject is no longer just “detect a column type”. The goal is to produce detection good enough to auto-configure an import, build a semantic_profile, and propose useful affordances and suggestions without relying on an LLM.

Product Objective

The product objective is not academic perfection on fine-grained concepts. The system must primarily:

  • recognise the correct role of a column;

  • recognise a few critical concepts that change product behaviour;

  • perform well on new, multilingual, and partially anonymous datasets;

  • avoid high-confidence false positives;

  • feed a usable semantic_profile for transformer/widget suggestions.

In practice, a confusion between measurement.height and measurement.diameter is less serious than a confusion between identifier.plot and statistic.count.

Adopted Architecture

The branch has converged on a local, compact, and explainable hybrid pipeline:

  1. Exact aliases

  2. Header branch

  3. Values branch

  4. Fusion

  5. Product semantic projection

1. Exact Aliases

Aliases provide a high-precision fast path for known column names. They remain essential, but must stay conservative:

  • an ambiguous alias should be disabled;

  • an exact alias must not bypass the classifier if doing so creates false positives at confidence 1.0.

References:

  • src/niamoto/core/imports/ml/alias_registry.py

  • src/niamoto/core/imports/ml/column_aliases.yaml

2. Header Branch

The header branch classifies the column name from a normalised enriched text. It is the best-performing branch when the header is informative.

Technology:

  • TF-IDF char n-grams

  • Logistic Regression

References:

  • ml/scripts/train/train_header_model.py

  • src/niamoto/core/imports/ml/header_features.py

3. Values Branch

The values branch learns from statistics and patterns extracted from the values:

  • numerical distributions;

  • simple regexes;

  • booleans, dates, coordinates;

  • signals from encoded/categorical columns.

It is less accurate alone than header, but it is decisive for:

  • anonymous headers;

  • ambiguous cases;

  • certain concepts that are strongly detectable by pattern.

References:

  • ml/scripts/train/train_value_model.py

  • src/niamoto/core/imports/ml/value_features.py

4. Fusion

Fusion combines the two branches in a shared concept space. It must not be a simple implicit average:

  • it receives the aligned probabilities from both branches;

  • it uses confidence and disagreement meta-features;

  • it can integrate targeted guardrails for frequent errors.

Fusion is the right layer for correcting cases where one branch becomes too dominant on a particular domain.

References:

  • ml/scripts/train/train_fusion.py

  • src/niamoto/core/imports/ml/fusion_features.py

5. Product Semantic Projection

The real product output is not just a raw concept. The current branch projects detection towards:

  • a role

  • a concept

  • affordances and suggestions

This layer is what aligns detection with the Niamoto product.

References:

  • src/niamoto/core/imports/ml/semantic_profile.py

  • src/niamoto/core/imports/ml/affordance_matcher.py

  • src/niamoto/core/imports/profiler.py

Why This Architecture

This architecture is suited to the real constraints of the project:

  • limited annotated data relative to the number of concepts;

  • high heterogeneity of datasets;

  • multilingual;

  • need for explainability;

  • local execution;

  • short training cost;

  • product value closer to the correct role and correct suggestion than to the perfect fine-grained concept.

A larger end-to-end approach would be more fragile here than a compact hybrid system with targeted rules.

What We Are Really Trying to Improve

The branch does not aim to maximise a simple classification score. It aims to improve:

  • the correct auto-configuration rate;

  • robustness on new datasets;

  • handling of anonymous columns;

  • the quality of output suggestions;

  • the ability to abstain or remain cautious on hard cases.

Retained Evaluation Ground Truth

The final metric targeted by the branch is the NiamotoOfflineScore, computed in ml/scripts/eval/evaluation.py and exposed by ml/scripts/eval/evaluate.py.

The score combines:

  • role_macro_f1

  • critical_concept_macro_f1

  • anonymous_role_macro_f1

  • pair_consistency

  • confidence_quality

  • dataset_outcome

The important holdouts are:

  • languages: fr, es, de, zh

  • families: dwc_gbif, forest_inventory, tropical_field, research_traits

  • anonymous columns

Role of Autoresearch

autoresearch must not decide the architecture. It must locally optimise an already well-framed system.

Expected role:

  • propose bounded variants;

  • evaluate quickly;

  • keep improvements;

  • reject regressions;

  • accelerate tuning.

What it must not do:

  • change the product ground truth;

  • optimise a proxy score at the expense of guardrails;

  • introduce unvalidated aggressive rules;

  • silently degrade a hard holdout to gain elsewhere.

Current Guardrails

Results observed on the branch indicate that certain domains must be treated as explicit guardrails:

  • forest_inventory

  • tropical_field

  • fr

The largest risk identified at this stage is over-prediction of ill-suited concepts such as statistic.count on business-coded columns.