Automatic column detection¶

Status: Active Audience: Team, AI agents, curious developers Purpose: Conceptual overview of the ML detection system and its current state

What it does¶

You import a forest inventory CSV file into Niamoto. Instead of manually configuring each column (“this is a diameter, this is a species, this is coordinates”), Niamoto automatically detects the content and proposes a complete dashboard: diameter histogram, distribution map, breakdown by family.

You only need to adjust if necessary.

Why it is necessary¶

Every team names its columns differently:

What it is	French Guiana	France IFN	FIA (US)	Spain	Anonymous
Diameter	`diam`	`C13`	`DIA`	`dap`	`X1`
Height	`haut`	`HTOT`	`HT`	`altura`	`col_2`
Species	`espece`	`ESPAR`	`SPCD`	`especie`	`X5`
Latitude	`lat`	`YL`	`LAT`	`latitud`	`col_3`

Without automatic detection, every user must manually configure their columns before being able to visualise anything. This is a barrier to adoption.

How it works¶

The system detects the role of each column — that is, what can be done with it:

Detected role	What Niamoto proposes
Numeric measurement	Histogram, statistical summary, scatter plot
Taxonomy	Breakdown by family/genus, sunburst
Geographic coordinates	Interactive map
Temporal data	Timeline, year filter
Category	Bar chart, donut chart
Identifier	Join key between tables

Two complementary signals are combined to achieve this:

The column name — diametre and diametro share the same letter sequences. A character n-gram model naturally groups them together, even across related languages.
The values — a diameter follows a log-normal distribution between 5 and 300, coordinates lie between -90 and 90, a species name follows the “Genus species” format. When the column name is anonymous (X1), the values take over.

Both are fused into a final prediction. The user can then refine each transformer/widget pair in the GUI.

Training data¶

The model is trained on 2,540 labelled columns from:

104 real datasets: IFN France, FIA US, GBIF variants, GUYADIV, tested Niamoto instances, TAXREF, ETS extensions, sPlotOpen, and several tropical or research-oriented ecological datasets
6 continents, 8 languages (EN, FR, ES, PT, DE, ID + anonymous headers)
61 concepts organised into roles: taxonomy, location, measurements, environment, statistics, temporal, categories, identifiers

All detection runs locally with scikit-learn (~3 MB of dependencies). No network required, no LLM.

Contributing¶

To improve detection for a poorly recognised column type:

Add aliases in src/niamoto/core/imports/ml/column_aliases.yaml — no ML needed, just a YAML file. Example: add "circonference" as an alias for measurement.diameter in French.
Add training data in ml/scripts/data/build_gold_set.py — label the columns from a new dataset and reference it in the source list.
Retrain: uv run python -m ml.scripts.train.train_header_model && uv run python -m ml.scripts.train.train_value_model

Current scores¶

Model	Macro-F1	What this means
Header (column name)	0.77	The strongest branch when names are informative
Values (statistical values)	0.38	Values alone remain ambiguous, but they are improving on anonymous and numeric cases
Fusion (header + values)	ProductScore 80.84 / GlobalScore 82.76	Combined signal from both branches on the current offline benchmark

The header score is the most important because in the majority of cases columns have informative names. The values model kicks in when the name is anonymous or ambiguous.

Known limitations¶

Very rare columns (< 5 examples in the gold set) are grouped under generic categories
Confidence calibration is not yet in place — the model cannot yet say “I am 85% confident”
The values model remains weak at distinguishing two measurement types from each other (diameter vs height) — but this is not blocking since the “measurement” role is sufficient to suggest a histogram

Technical architecture¶

Imported CSV
     │
     ├── Column name ──→ TF-IDF char n-grams ──→ LogisticRegression
     │                                                   │
     ├── Values ──→ 37 statistical features ──→ HistGradientBoosting
     │                                                   │
     └── Fusion ──→ LogReg calibrated on probabilities from both branches
                          │
                   Detected role + confidence
                          │
                   Suggested transformer/widget pairs

Academic References¶

Project	Year	Approach	Features	Performance	Ecological Relevance	Status
Sherlock	2019	Deep NN	1,588	F1: 0.89	Low (generic types)	Abandoned
Sato	2020	Hybrid DL + Topic	1,588+	F1: 0.92	Low	Inactive
Pythagoras	2024	GNN	Graph-based	F1: 0.94	Medium (numeric)	Active
GAIT	2024	GNN variants	Multi-graph	F1: 0.93	Medium	Active
GitTables	2023	Dataset	N/A	Benchmark	High (diverse)	Active

Niamoto’s approach differs from these academic systems: it uses a lightweight hybrid pipeline (TF-IDF + HistGradientBoosting + Fusion) optimized for ecological data, running fully offline with scikit-learn (~3 MB).

Key files¶

File	Purpose
`src/niamoto/core/imports/ml/alias_registry.py`	Name → concept matching via multilingual aliases
`src/niamoto/core/imports/ml/column_aliases.yaml`	25 concepts × 8 languages
`ml/scripts/eval/evaluation.py`	Evaluation harness (GroupKFold, holdouts)
`ml/scripts/data/concept_taxonomy.py`	Fusion of 111 fine concepts → 61 concepts
`src/niamoto/core/imports/profiler.py`	DataProfiler with `ml_mode=auto/off/force`
`ml/scripts/data/build_gold_set.py`	Gold set construction (88 sources)
`ml/scripts/train/train_header_model.py`	Header branch training
`ml/scripts/train/train_value_model.py`	Values branch training
`ml/scripts/eval/evaluate.py`	CLI metric for evaluation
`ml/data/gold_set.json`	2,231 labelled columns