Automatic column detection¶
Status: Active Audience: Team, AI agents, curious developers Purpose: Conceptual overview of the ML detection system and its current state
What it does¶
You import a forest inventory CSV file into Niamoto. Instead of manually configuring each column (“this is a diameter, this is a species, this is coordinates”), Niamoto automatically detects the content and proposes a complete dashboard: diameter histogram, distribution map, breakdown by family.
You only need to adjust if necessary.
Why it is necessary¶
Every team names its columns differently:
What it is |
French Guiana |
France IFN |
FIA (US) |
Spain |
Anonymous |
|---|---|---|---|---|---|
Diameter |
|
|
|
|
|
Height |
|
|
|
|
|
Species |
|
|
|
|
|
Latitude |
|
|
|
|
|
Without automatic detection, every user must manually configure their columns before being able to visualise anything. This is a barrier to adoption.
How it works¶
The system detects the role of each column — that is, what can be done with it:
Detected role |
What Niamoto proposes |
|---|---|
Numeric measurement |
Histogram, statistical summary, scatter plot |
Taxonomy |
Breakdown by family/genus, sunburst |
Geographic coordinates |
Interactive map |
Temporal data |
Timeline, year filter |
Category |
Bar chart, donut chart |
Identifier |
Join key between tables |
Two complementary signals are combined to achieve this:
The column name —
diametreanddiametroshare the same letter sequences. A character n-gram model naturally groups them together, even across related languages.The values — a diameter follows a log-normal distribution between 5 and 300, coordinates lie between -90 and 90, a species name follows the “Genus species” format. When the column name is anonymous (
X1), the values take over.
Both are fused into a final prediction. The user can then refine each transformer/widget pair in the GUI.
Training data¶
The model is trained on 2,540 labelled columns from:
104 real datasets: IFN France, FIA US, GBIF variants, GUYADIV, tested Niamoto instances, TAXREF, ETS extensions, sPlotOpen, and several tropical or research-oriented ecological datasets
6 continents, 8 languages (EN, FR, ES, PT, DE, ID + anonymous headers)
61 concepts organised into roles: taxonomy, location, measurements, environment, statistics, temporal, categories, identifiers
All detection runs locally with scikit-learn (~3 MB of dependencies). No network required, no LLM.
Contributing¶
To improve detection for a poorly recognised column type:
Add aliases in
src/niamoto/core/imports/ml/column_aliases.yaml— no ML needed, just a YAML file. Example: add"circonference"as an alias formeasurement.diameterin French.Add training data in
ml/scripts/data/build_gold_set.py— label the columns from a new dataset and reference it in the source list.Retrain:
uv run python -m ml.scripts.train.train_header_model && uv run python -m ml.scripts.train.train_value_model
Current scores¶
Model |
Macro-F1 |
What this means |
|---|---|---|
Header (column name) |
0.77 |
The strongest branch when names are informative |
Values (statistical values) |
0.38 |
Values alone remain ambiguous, but they are improving on anonymous and numeric cases |
Fusion (header + values) |
ProductScore 80.84 / GlobalScore 82.76 |
Combined signal from both branches on the current offline benchmark |
The header score is the most important because in the majority of cases columns have informative names. The values model kicks in when the name is anonymous or ambiguous.
Known limitations¶
Very rare columns (< 5 examples in the gold set) are grouped under generic categories
Confidence calibration is not yet in place — the model cannot yet say “I am 85% confident”
The values model remains weak at distinguishing two measurement types from each other (diameter vs height) — but this is not blocking since the “measurement” role is sufficient to suggest a histogram
Technical architecture¶
Imported CSV
│
├── Column name ──→ TF-IDF char n-grams ──→ LogisticRegression
│ │
├── Values ──→ 37 statistical features ──→ HistGradientBoosting
│ │
└── Fusion ──→ LogReg calibrated on probabilities from both branches
│
Detected role + confidence
│
Suggested transformer/widget pairs
Academic References¶
Project |
Year |
Approach |
Features |
Performance |
Ecological Relevance |
Status |
|---|---|---|---|---|---|---|
Sherlock |
2019 |
Deep NN |
1,588 |
F1: 0.89 |
Low (generic types) |
Abandoned |
Sato |
2020 |
Hybrid DL + Topic |
1,588+ |
F1: 0.92 |
Low |
Inactive |
Pythagoras |
2024 |
GNN |
Graph-based |
F1: 0.94 |
Medium (numeric) |
Active |
GAIT |
2024 |
GNN variants |
Multi-graph |
F1: 0.93 |
Medium |
Active |
GitTables |
2023 |
Dataset |
N/A |
Benchmark |
High (diverse) |
Active |
Niamoto’s approach differs from these academic systems: it uses a lightweight hybrid pipeline (TF-IDF + HistGradientBoosting + Fusion) optimized for ecological data, running fully offline with scikit-learn (~3 MB).
Key files¶
File |
Purpose |
|---|---|
|
Name → concept matching via multilingual aliases |
|
25 concepts × 8 languages |
|
Evaluation harness (GroupKFold, holdouts) |
|
Fusion of 111 fine concepts → 61 concepts |
|
DataProfiler with |
|
Gold set construction (88 sources) |
|
Header branch training |
|
Values branch training |
|
CLI metric for evaluation |
|
2,231 labelled columns |