# Automatic column detection

> Status: Active
> Audience: Team, AI agents, curious developers
> Purpose: Conceptual overview of the ML detection system and its current
> state

## What it does

You import a forest inventory CSV file into Niamoto. Instead of manually configuring each column ("this is a diameter, this is a species, this is coordinates"), Niamoto **automatically detects** the content and proposes a complete dashboard: diameter histogram, distribution map, breakdown by family.

You only need to adjust if necessary.

## Why it is necessary

Every team names its columns differently:

| What it is | French Guiana | France IFN | FIA (US) | Spain | Anonymous |
|------------|---------------|------------|----------|-------|-----------|
| Diameter | `diam` | `C13` | `DIA` | `dap` | `X1` |
| Height | `haut` | `HTOT` | `HT` | `altura` | `col_2` |
| Species | `espece` | `ESPAR` | `SPCD` | `especie` | `X5` |
| Latitude | `lat` | `YL` | `LAT` | `latitud` | `col_3` |

Without automatic detection, every user must manually configure their columns before being able to visualise anything. This is a barrier to adoption.

## How it works

The system detects the **role** of each column — that is, what can be done with it:

| Detected role | What Niamoto proposes |
|--------------|----------------------|
| Numeric measurement | Histogram, statistical summary, scatter plot |
| Taxonomy | Breakdown by family/genus, sunburst |
| Geographic coordinates | Interactive map |
| Temporal data | Timeline, year filter |
| Category | Bar chart, donut chart |
| Identifier | Join key between tables |

Two complementary signals are combined to achieve this:

1. **The column name** — `diametre` and `diametro` share the same letter sequences. A character n-gram model naturally groups them together, even across related languages.

2. **The values** — a diameter follows a log-normal distribution between 5 and 300, coordinates lie between -90 and 90, a species name follows the "Genus species" format. When the column name is anonymous (`X1`), the values take over.

Both are fused into a final prediction. The user can then refine each transformer/widget pair in the GUI.

## Training data

The model is trained on **2,540 labelled columns** from:

- **104 real datasets**: IFN France, FIA US, GBIF variants, GUYADIV, tested
  Niamoto instances, TAXREF, ETS extensions, sPlotOpen, and several tropical or
  research-oriented ecological datasets
- **6 continents**, **8 languages** (EN, FR, ES, PT, DE, ID + anonymous headers)
- **61 concepts** organised into roles: taxonomy, location, measurements, environment, statistics, temporal, categories, identifiers

All detection runs locally with scikit-learn (~3 MB of dependencies). No network required, no LLM.

## Contributing

To improve detection for a poorly recognised column type:

1. **Add aliases** in `src/niamoto/core/imports/ml/column_aliases.yaml` — no ML needed, just a YAML file. Example: add `"circonference"` as an alias for `measurement.diameter` in French.

2. **Add training data** in `ml/scripts/data/build_gold_set.py` — label the columns from a new dataset and reference it in the source list.

3. **Retrain**: `uv run python -m ml.scripts.train.train_header_model && uv run python -m ml.scripts.train.train_value_model`

## Current scores

| Model | Macro-F1 | What this means |
|-------|----------|-----------------|
| Header (column name) | 0.77 | The strongest branch when names are informative |
| Values (statistical values) | 0.38 | Values alone remain ambiguous, but they are improving on anonymous and numeric cases |
| Fusion (header + values) | ProductScore 80.84 / GlobalScore 82.76 | Combined signal from both branches on the current offline benchmark |

The header score is the most important because in the majority of cases columns have informative names. The values model kicks in when the name is anonymous or ambiguous.

## Known limitations

- Very rare columns (< 5 examples in the gold set) are grouped under generic categories
- Confidence calibration is not yet in place — the model cannot yet say "I am 85% confident"
- The values model remains weak at distinguishing two measurement types from each other (diameter vs height) — but this is not blocking since the "measurement" role is sufficient to suggest a histogram

## Technical architecture

```
Imported CSV
     │
     ├── Column name ──→ TF-IDF char n-grams ──→ LogisticRegression
     │                                                   │
     ├── Values ──→ 37 statistical features ──→ HistGradientBoosting
     │                                                   │
     └── Fusion ──→ LogReg calibrated on probabilities from both branches
                          │
                   Detected role + confidence
                          │
                   Suggested transformer/widget pairs
```

## Academic References

| Project | Year | Approach | Features | Performance | Ecological Relevance | Status |
|---------|------|----------|----------|-------------|---------------------|--------|
| **Sherlock** | 2019 | Deep NN | 1,588 | F1: 0.89 | Low (generic types) | Abandoned |
| **Sato** | 2020 | Hybrid DL + Topic | 1,588+ | F1: 0.92 | Low | Inactive |
| **Pythagoras** | 2024 | GNN | Graph-based | F1: 0.94 | Medium (numeric) | Active |
| **GAIT** | 2024 | GNN variants | Multi-graph | F1: 0.93 | Medium | Active |
| **GitTables** | 2023 | Dataset | N/A | Benchmark | High (diverse) | Active |

Niamoto's approach differs from these academic systems: it uses a lightweight hybrid pipeline (TF-IDF + HistGradientBoosting + Fusion) optimized for ecological data, running fully offline with scikit-learn (~3 MB).

## Key files

| File | Purpose |
|------|---------|
| `src/niamoto/core/imports/ml/alias_registry.py` | Name → concept matching via multilingual aliases |
| `src/niamoto/core/imports/ml/column_aliases.yaml` | 25 concepts × 8 languages |
| `ml/scripts/eval/evaluation.py` | Evaluation harness (GroupKFold, holdouts) |
| `ml/scripts/data/concept_taxonomy.py` | Fusion of 111 fine concepts → 61 concepts |
| `src/niamoto/core/imports/profiler.py` | DataProfiler with `ml_mode=auto/off/force` |
| `ml/scripts/data/build_gold_set.py` | Gold set construction (88 sources) |
| `ml/scripts/train/train_header_model.py` | Header branch training |
| `ml/scripts/train/train_value_model.py` | Values branch training |
| `ml/scripts/eval/evaluate.py` | CLI metric for evaluation |
| `ml/data/gold_set.json` | 2,231 labelled columns |