Automatic column detection

Status: Active Audience: Team, AI agents, curious developers Purpose: Conceptual overview of the ML detection system and its current state

What it does

You import a forest inventory CSV file into Niamoto. Instead of manually configuring each column (“this is a diameter, this is a species, this is coordinates”), Niamoto automatically detects the content and proposes a complete dashboard: diameter histogram, distribution map, breakdown by family.

You only need to adjust if necessary.

Why it is necessary

Every team names its columns differently:

What it is

French Guiana

France IFN

FIA (US)

Spain

Anonymous

Diameter

diam

C13

DIA

dap

X1

Height

haut

HTOT

HT

altura

col_2

Species

espece

ESPAR

SPCD

especie

X5

Latitude

lat

YL

LAT

latitud

col_3

Without automatic detection, every user must manually configure their columns before being able to visualise anything. This is a barrier to adoption.

How it works

The system detects the role of each column — that is, what can be done with it:

Detected role

What Niamoto proposes

Numeric measurement

Histogram, statistical summary, scatter plot

Taxonomy

Breakdown by family/genus, sunburst

Geographic coordinates

Interactive map

Temporal data

Timeline, year filter

Category

Bar chart, donut chart

Identifier

Join key between tables

Two complementary signals are combined to achieve this:

  1. The column namediametre and diametro share the same letter sequences. A character n-gram model naturally groups them together, even across related languages.

  2. The values — a diameter follows a log-normal distribution between 5 and 300, coordinates lie between -90 and 90, a species name follows the “Genus species” format. When the column name is anonymous (X1), the values take over.

Both are fused into a final prediction. The user can then refine each transformer/widget pair in the GUI.

Training data

The model is trained on 2,540 labelled columns from:

  • 104 real datasets: IFN France, FIA US, GBIF variants, GUYADIV, tested Niamoto instances, TAXREF, ETS extensions, sPlotOpen, and several tropical or research-oriented ecological datasets

  • 6 continents, 8 languages (EN, FR, ES, PT, DE, ID + anonymous headers)

  • 61 concepts organised into roles: taxonomy, location, measurements, environment, statistics, temporal, categories, identifiers

All detection runs locally with scikit-learn (~3 MB of dependencies). No network required, no LLM.

Contributing

To improve detection for a poorly recognised column type:

  1. Add aliases in src/niamoto/core/imports/ml/column_aliases.yaml — no ML needed, just a YAML file. Example: add "circonference" as an alias for measurement.diameter in French.

  2. Add training data in ml/scripts/data/build_gold_set.py — label the columns from a new dataset and reference it in the source list.

  3. Retrain: uv run python -m ml.scripts.train.train_header_model && uv run python -m ml.scripts.train.train_value_model

Current scores

Model

Macro-F1

What this means

Header (column name)

0.77

The strongest branch when names are informative

Values (statistical values)

0.38

Values alone remain ambiguous, but they are improving on anonymous and numeric cases

Fusion (header + values)

ProductScore 80.84 / GlobalScore 82.76

Combined signal from both branches on the current offline benchmark

The header score is the most important because in the majority of cases columns have informative names. The values model kicks in when the name is anonymous or ambiguous.

Known limitations

  • Very rare columns (< 5 examples in the gold set) are grouped under generic categories

  • Confidence calibration is not yet in place — the model cannot yet say “I am 85% confident”

  • The values model remains weak at distinguishing two measurement types from each other (diameter vs height) — but this is not blocking since the “measurement” role is sufficient to suggest a histogram

Technical architecture

Imported CSV
     │
     ├── Column name ──→ TF-IDF char n-grams ──→ LogisticRegression
     │                                                   │
     ├── Values ──→ 37 statistical features ──→ HistGradientBoosting
     │                                                   │
     └── Fusion ──→ LogReg calibrated on probabilities from both branches
                          │
                   Detected role + confidence
                          │
                   Suggested transformer/widget pairs

Academic References

Project

Year

Approach

Features

Performance

Ecological Relevance

Status

Sherlock

2019

Deep NN

1,588

F1: 0.89

Low (generic types)

Abandoned

Sato

2020

Hybrid DL + Topic

1,588+

F1: 0.92

Low

Inactive

Pythagoras

2024

GNN

Graph-based

F1: 0.94

Medium (numeric)

Active

GAIT

2024

GNN variants

Multi-graph

F1: 0.93

Medium

Active

GitTables

2023

Dataset

N/A

Benchmark

High (diverse)

Active

Niamoto’s approach differs from these academic systems: it uses a lightweight hybrid pipeline (TF-IDF + HistGradientBoosting + Fusion) optimized for ecological data, running fully offline with scikit-learn (~3 MB).

Key files

File

Purpose

src/niamoto/core/imports/ml/alias_registry.py

Name → concept matching via multilingual aliases

src/niamoto/core/imports/ml/column_aliases.yaml

25 concepts × 8 languages

ml/scripts/eval/evaluation.py

Evaluation harness (GroupKFold, holdouts)

ml/scripts/data/concept_taxonomy.py

Fusion of 111 fine concepts → 61 concepts

src/niamoto/core/imports/profiler.py

DataProfiler with ml_mode=auto/off/force

ml/scripts/data/build_gold_set.py

Gold set construction (88 sources)

ml/scripts/train/train_header_model.py

Header branch training

ml/scripts/train/train_value_model.py

Values branch training

ml/scripts/eval/evaluate.py

CLI metric for evaluation

ml/data/gold_set.json

2,231 labelled columns