Niamoto — ML Semantic Column Detection: Dashboard

01Pipeline Architecture

Hybrid 5-layer pipeline: exact aliases → header classification → values classification → fusion → semantic product projection. The goal is not academic perfection on fine-grained concepts, but correct auto-configuration of an import.

Unknown Column

↓

1. EXACT ALIASES

High-precision fast-path (multilingual YAML)

If the name matches a known alias → immediate detection

↓ otherwise

2. HEADER

TF-IDF char n-grams + L1 LogReg

Analyzes the column name

F1 = 0.7530

3. VALUES

37 features + HistGradientBoosting

Statistically analyzes values

F1 = 0.3780

↓

4. FUSION

LogReg + meta-features (disagreement, confidence, coded headers, cross-rank)

Combines probabilities, corrects inter-branch biases

F1 = 0.6388

↓

5. SEMANTIC PROJECTION
role → concept → affordances → transformer/widget suggestions

Exact Aliases — High-precision fast-path

File	`column_aliases.yaml` — 25 concepts × 8 languages (en, fr, es, pt, de, id, dwc, la)
Exact matching	Normalization → O(1) lookup. If found → confidence 1.0, bypass classifier.
Fuzzy matching	Levenshtein (rapidfuzz) ratio ≥ 80% if no exact match.
Safeguard	An ambiguous alias must be disabled. An exact alias must not create false positives at confidence 1.0.

Header Model — Hyperparameters

analyzer	`char_wb` — character n-grams respecting word boundaries
ngram_range	`(2, 5)` — character chunks of 2 to 5 (e.g., "di", "dia", "diam", "diame")
max_features	`5,000` — keeps the 5000 most discriminative n-grams
sublinear_tf	`True` — uses log(1 + frequency) instead of raw frequency
C	`130.0` — inverse regularization strength (high value = confident model)
penalty	`L1` (Lasso) — zeros out useless n-grams, automatic feature selection
solver	`saga` — fast algorithm compatible with L1
class_weight	`balanced` — rare concepts weigh as much as frequent ones

Values Model — 37 statistical features

Numeric statistics (14)	mean, std, min, max, skew, kurtosis, Q25, Q50, Q75, range, CV, neg_ratio, int_ratio, zero_ratio
Distribution (3)	unique_ratio, null_ratio, entropy
Characters (6)	mean_length, std_length, digit_ratio, alpha_ratio, space_ratio, mean_word_count
Regex patterns (4)	pct_date_iso, pct_coordinate, pct_boolean, pct_uuid
Biological patterns (2)	binomial_score (Genus species), family_suffix (aceae, idae, ales)
Numeric domains (6)	mean_decimals, in_lat_range, in_lon_range, in_01_range, small_int_ratio, pct_starts_upper
Meta (2)	is_numeric, n_values

Classifier: HistGradientBoostingClassifier(max_iter=500, max_depth=10, learning_rate=0.05, class_weight="balanced")

Fusion Model — Intelligent branch combination

Header input	61 probabilities (one per concept) from header model
Values input	61 probabilities from values model
Basic meta (3)	is_anonymous, null_ratio, unique_ratio
Enriched meta	max_proba, top1-top2 margin, entropy, header/values agreement/disagreement, `statistic.count` flags, code-like header detection
Cross-rank reciprocity (4) — NEW	header_top1_value_rank, value_top1_header_rank, top2_cross_match, both_weak
Targeted damping	If code-like header + a branch pushes `statistic.count` → damped confidence before fusion
Classifier	`LogisticRegression(C=1.0, solver="lbfgs", class_weight="balanced")`

Cross-rank reciprocity features capture how well the two branches agree across their top predictions. They were a strong gain in the previous optimization cycle; after the 2026-03-21 retrain the current baseline is ProductScore 80.84.

How `column_aliases` fits into the ML system

The YAML alias file is not a separate heuristic living next to the model. It is the first stage of the same semantic detection pipeline. If a header is known exactly, Niamoto resolves it immediately. If not, it falls through to ML.

CSV Upload

The user uploads a CSV such as occurrences.csv or plots.csv.

Niamoto reads column names plus a sample of values.

Alias Registry

column_aliases.yaml is loaded into an exact-match registry.

If a header like ddlat or tax_fam is known, prediction returns immediately with high confidence.

ML Classifier

If no exact alias matches, the column goes to the ML stack:

header model + values model + fusion.

Fallback Rules

Only after aliases and ML do a few structural rules apply.

Example: WKT values like POINT(...) can still be recognized as geometry.

Product Output

The final semantic type drives auto-config, affordances, widgets, and import suggestions.

This is what the user sees in smart_config and profiling.

Training And Evaluation Loop

The same concepts are used to build ml/data/gold_set.json and to evaluate frozen acceptance benchmarks.

So aliases affect both runtime precision and the quality of the training/evaluation loop.

Why we edited aliases before touching the model

If a header is stable and known, exact aliases are more reliable than asking ML to rediscover it every time.
This is especially true for semi-standardized exports like tax_fam, ddlon, or nbe_stem.
Improving aliases is therefore an improvement to the pipeline itself, not a workaround outside it.

When aliases are not enough

If the header is unknown, coded, noisy, or ambiguous, the ML classifier must generalize from patterns and values.
If train and eval labels contradict each other, both aliases and ML become unreliable.
This is why benchmark hygiene and label alignment had to be fixed before adding more data.

Method	Role correct	Role %	Concept correct	Concept %
Alias only	11/29	38%	11/29	38%
ML pipeline	16/29	55%	14/29	48%

06Error Patterns

Error patterns updated after the acquisition wave retrain and full eval suite run on 2026-03-21.

ERROR-01 Partially fixed

Over-prediction of `measurement.diameter`

Fixed: diversity indices (shannon, pielou, simpson) and holdridge now matched by alias. basal_area → diameter merge bug in concept_taxonomy.py corrected to → biomass.

Remaining: booleans (flower, fruit, in_um) and taxonomic counts (gymnospermae, monocotyledonae, dicotyledonae) still predicted as diameter. The values branch still has insufficient signal to distinguish boolean/count distributions from continuous measurements.

ERROR-02 Fixed

`measurement.trait` unrecognized → now 67%

Alias registry covers shannon, pielou, simpson, bark_thickness, leaf_ldmc, leaf_thickness. Gold set enriched with 8 trait examples from niamoto-nc. Detection: 0/12 → 8/12 (67%).

ERROR-03 Partially fixed

`category.*` poorly detected

Fixed: holdridge → category.vegetation via alias. strata → category.vegetation via alias. phenology, pheno → category.ecology via alias. Gold set enriched with bioclimate, phenology, stratum examples.

Remaining: booleans (in_forest, in_um, flower, fruit) still misclassified. The values branch treats True/False and 0/1 as numeric, not categorical.

ERROR-04 Partially fixed

`taxonomy.name` remains weak on non-binomial GBIF columns

Fixed: binomial species-name columns were real annotation issues and were aligned to taxonomy.species.

Remaining: truly non-binomial fields such as genericName, infraspecificEpithet, and scientificNameAuthorship are still systematically missed on GBIF exports. In the latest eval suite, taxonomy.name is still 0/3.

ERROR-05 Partially fixed

GBIF Darwin Core systematic errors → 9 → 5 remaining

Fixed by alias: catalogNumber → identifier.collection, basisOfRecord → category.basis, occurrenceStatus → category.status, taxonomicStatus → category.status. GBIF concept accuracy: 76% → 88-90%.

Remaining (5): acceptedTaxonKey, speciesKey (numeric keys → need identifier.taxon alias for compound names), genericName, infraspecificEpithet, scientificNameAuthorship.

ERROR-06 Open

Coded inventory generalization remains weak on frozen FIA benchmark

Current state: acceptance-fia-or reaches only 63.6% concept accuracy after retrain, while acceptance-niamoto-gb is at 100%.

Main misses: SPCD, CR, VOLCFNET, VOLBFNET, plus repeated confusions on measurement.biomass, identifier.taxon, and missing category.habitat. This is still the clearest gap between product-focused learning and out-of-train coded inventories.

07Training Evolution

Autoresearch-driven optimization, cross-rank reciprocity features, surrogate evaluation loop, and batch optimization.

7a — Autoresearch Progression

The autoresearch pattern

Inspired by Karpathy (2026): an agent modifies hyperparameters, evaluates, keeps improvements, rejects regressions, and loops.

Result: 24 header iterations (+41 cumulative points), 8 values iterations (+6.5 cumulative points). Each gain tracked by a git commit.

7b — Cross-rank Reciprocity Features

4 new fusion features capturing branch agreement depth

Feature	Description
header_top1_value_rank	Rank of header's top-1 prediction in values' probability vector
value_top1_header_rank	Rank of values' top-1 prediction in header's probability vector
top2_cross_match	Whether either branch's top-2 contains the other's top-1
both_weak	Both branches have low max probability

surrogate-fast: +0.92 | surrogate-mid: +0.89 | ProductScore: +0.79

7c — Surrogate Evaluation Loop

Multi-tier validation chain

Stage	Time	Purpose
Cache build	~8 min	One-shot feature extraction for all columns
surrogate-fast	~1.7s/eval	Quick filter for unpromising changes
surrogate-mid	~15s/eval	GroupKFold with fewer folds
product-score	~2 min	Full weighted family evaluation
niamoto-score	~5 min	Complete offline score with all holdouts

Validation chain: fast → mid → product-score → niamoto-score. Each tier gates the next.

7d — Batch Optimization

Training time: 5h → 15min (20x speedup)

Batch fusion feature extraction replaced per-column computation. All feature vectors are computed in a single pass over pre-cached header and values probabilities. Training loop and cross-validation run on pre-computed matrices.

Result: identical model outputs, 20x faster iteration. Enables more autoresearch cycles per session.

Batch optimization produces identical results with 20x speedup.

08Training & Improvement Workflow

End-to-end pipeline from raw data to evaluation. Each box is an independent step with its own script. Full guide: docs/05-ml-detection/training-guide.md

DATA SOURCES

ml/data/silver/

CSV files from inventories, GBIF, traits, etc.

LABEL MAPPINGS

build_gold_set.py

Column → concept mappings per dataset

↓

GOLD SET

uv run python -m ml.scripts.data.build_gold_set

2 540 labeled columns × 61 concepts × value statistics

ml/data/gold_set.json

↓

TRAIN HEADER

TF-IDF + LogReg

F1 = 0.7530

TRAIN VALUES

37 features + HGBT

F1 = 0.3780

TRAIN FUSION

LogReg + cross-rank

F1 = 0.6388

↓

ALIAS REGISTRY

column_aliases.yaml

25+ concepts × 8 languages
No retraining needed

3 MODELS

ml/models/*.joblib

header (2.6 MB) + values (40 MB)
+ fusion (80 KB)

↓

EVALUATION SUITE

uv run python -m ml.scripts.eval.run_eval_suite

9 datasets × 478 columns × verified annotations

84.7% concept accuracy

↓

DIAGNOSIS & ITERATE

Concept absent → add to gold set & retrain
Column name univoque → add alias (no retrain)
Bad merge in taxonomy → fix concept_taxonomy.py
Wrong annotation → fix eval annotations

8a — Session Improvement Timeline (2026-03-20 → 2026-03-21)

Detail: Improvement steps

Step	Action	Concept%	Delta
V1	Initial annotations (418 cols, 7 datasets)	66.5%	—
V2	Fix annotations: taxonomy.name → species, plot_name context	71.1%	+4.6
V3	Verify annotations against actual values	71.3%	+0.2
V4	Expand alias registry (+13 concepts)	76.6%	+5.3
V5	Enrich gold set (niamoto-nc) + retrain + fix basal_area merge	77.5%	+0.9
V6	Acquire TAXREF v18 + ETS + sPlotOpen, rebuild gold set, full retrain	85.4%	+7.9

09Methodology Justification

Each technical choice is justified by academic literature. Bracket numbers refer to the References section.

8.1 — Why character n-grams?

TfidfVectorizer(analyzer="char_wb", ngram_range=(2, 5))

Ecological column names are short strings (1-3 words) sharing Latin and Greek roots across languages. "diametre" (FR) and "diametro" (ES) generate the same trigrams: dia, iam, ame. Character n-grams capture these shared morphemes without a bilingual dictionary.

For strings under 50 characters (the typical length of a column name), character n-grams outperform word-level methods because there are too few tokens for stable statistics [11, 14].

The approach is the canonical choice for short multilingual strings.

[10] Bojanowski et al. (2017). Enriching Word Vectors with Subword Information. arXiv:1607.04606

[11] Cavnar & Trenkle (1994). N-gram-based Text Categorization. 99.8% accuracy for language identification.

[14] Apple Research. Language Identification from Very Short Strings.

8.2 — Why TF-IDF + LogisticRegression L1 instead of Transformers?

LogisticRegression(C=130, penalty="l1", class_weight="balanced")

With 2,525 labeled columns, a BERT model (110M parameters) risks severe overfitting. TF-IDF + L1 LogReg is an "embarrassingly strong baseline" that produces interpretable coefficients — one can inspect exactly which n-grams drive each prediction.

The L1 penalty automatically selects useful features: out of 5,000 n-grams, only a few hundred receive non-zero weights. It is a built-in filter, not a black box.

Criterion	TF-IDF + LR (our choice)	Fine-tuned BERT	Sentence Transformers
Data required	1k+	10k-100k	1k+
Training time	~3 seconds	10-60 min (GPU)	5-30 min
Interpretability	L1 coefficients	Black box	Black box
Offline / no GPU	Yes	No (GPU)	Yes (22 MB)
Model size	~3 MB	~440 MB	~90 MB

The choice is validated by the literature as scientifically robust, not as a compromise.

[9] Wang & Manning, ACL 2012. Baselines and Bigrams: Simple, Good Sentiment and Topic Classification. ACL

[16] Shmueli, Statistical Science 2010. To Explain or to Predict? DOI

8.3 — Why HistGradientBoosting for values?

HistGradientBoostingClassifier(max_iter=500, max_depth=10, lr=0.05)

The reference paper on tabular data (Grinsztajn et al., NeurIPS 2022) shows that tree-based models dominate neural networks on small-to-medium datasets (< 10k rows), especially when some features are uninformative — which is exactly our case (e.g., in_lat_range is useless for a non-geographic concept).

Our 37 features are a streamlined, interpretable version of Sherlock's 1,588 features [1], adapted to the size of our gold set.

HistGradientBoosting is the canonical choice for tabular data with < 10k samples.

[8] Grinsztajn, Oyallon & Varoquaux, NeurIPS 2022. Why do tree-based models still outperform deep learning on tabular data? arXiv:2207.08815

[1] Hulsebos et al., KDD 2019. Sherlock: A Deep Learning Approach to Semantic Data Type Detection. arXiv:1905.10688

8.4 — Why GroupKFold by data source?

GroupKFold(n_splits=5, groups=source_dataset)

Columns from the same dataset share naming conventions, data quality, and similar sampling protocols. A naive KFold would leak this shared information between train and test, artificially inflating scores.

This is the exact analogue of spatial block cross-validation used in species distribution modeling — observations from the same geographic block are kept together to avoid spatial autocorrelation.

Roberts et al. (2017) is the reference for structured ecological data: "standard random cross-validation on structured data leads to serious underestimation of predictive error."

GroupKFold is the correct evaluation strategy for grouped ecological data.

[12] Roberts et al., Ecography 2017. Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. DOI

[15] Dietterich, Neural Computation 1998. Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms.

8.5 — Why macro-F1?

f1_score(y_true, y_pred, average="macro")

Macro-F1 computes the F1 for each concept separately, then takes the unweighted average. A rare concept (5 examples: wood_density) weighs as much as a frequent one (191 examples: species).

In ecology, rare concepts are often the most scientifically valuable (soil chemistry, phenological dates, conservation status). Accuracy or micro-F1 would mask catastrophic performance on these minority concepts.

Metric	Treatment of rare classes	Our case (61 concepts, imbalanced distribution)
Accuracy	Dominated by frequent classes	Can show ~70% even if all rare concepts are misclassified
Micro-F1	Equivalent to accuracy	Same problem
Weighted-F1	Proportional to class size	Masks poor performance on rare concepts
Macro-F1	Equal weight for each class	Forces the model to also detect rare concepts well

Macro-F1 is the standard for imbalanced multi-class problems where minorities matter.

[13] Sokolova & Lapalme, Information Processing & Management 2009. A systematic analysis of performance measures for classification tasks.

[17] Opitz & Burst (2021). Macro F1 and Macro F1. arXiv:1911.03347

10Credible Alternatives

Comparison with published approaches in the literature. Our pipeline occupies a justified trade-off between the complexity of deep learning methods and the brittleness of manual rules.

Approach	Data required	Offline	Reported F1	Complexity	Niamoto status	Reference
TF-IDF + LR + HistGBT (Niamoto)	2.54k columns	Yes (scikit-learn)	0.639 CV / 84.7% eval	Low	Implemented	This work
Sherlock (DNN, 1588 features)	686k columns	Yes (Torch)	0.89 (weighted)	High	Insufficient data	[1]
Sato (CRF + table context)	686k columns	Yes (Torch)	0.93 (weighted)	High	Insufficient data	[2]
DoDuo (fine-tuned BERT)	10k+ columns	Yes (GPU)	0.90+	High	Insufficient data	[4]
LLM zero-shot (GPT-4 / Claude)	0	No (API)	0.85+ (EN)	Low	Gold set enrichment	[5]
Sentence Transformers	1k+ columns	Yes (22-90 MB)	~0.80 (estimated)	Medium	Not planned	[10]
Magneto (SLM + LLM)	Variable	Yes (inference)	SotA	High	To evaluate	[7]
Regex / manual rules	0	Yes	0.20-0.40	Very low	Integrated as features	--

Why deep learning methods are not suited to our context

Gold set size: With 2,525 columns from 94+ sources, we are 2 to 3 orders of magnitude below the typical needs of Sherlock (686k) or BERT (10k minimum). Fine-tuning a 110M parameter model on 2.5k examples would produce overfitting.

Source heterogeneity: Our data spans 8 languages, 5 continents, and highly varied naming conventions. Academic benchmarks (VizNet, WikiTables) are primarily English.

Offline constraint: Niamoto is designed to work without internet (fieldwork, oceanographic vessels). No embedded GPU.

Estimated break-even: Per Grinsztajn et al. (2022), deep learning surpasses boosted trees beyond ~10,000-50,000 examples. With ~5,000+ labeled columns, it would become relevant to explore Sentence Transformers or a lightweight fine-tuned BERT.

11Bibliography

Column type detection

Hulsebos, T., Hu, K., Bakker, M., et al. (2019). Sherlock: A Deep Learning Approach to Semantic Data Type Detection. KDD 2019. arXiv:1905.10688
Zhang, D., Suhara, Y., Li, J., et al. (2020). Sato: Contextual Semantic Type Detection in Tables. VLDB 2020. arXiv:1911.06311
Deng, X., Sun, H., Lees, A., et al. (2021). TURL: Table Understanding through Representation Learning. VLDB 2021. arXiv:2006.14806
Suhara, Y., Li, J., Li, Y., et al. (2022). Annotating Columns with Pre-trained Language Models. SIGMOD 2022. arXiv:2104.01785
Korini, K. & Bizer, C. (2023). Column Type Annotation using ChatGPT. VLDB Workshop 2023. arXiv:2306.00745
RACOON (2024). Retrieval-Augmented Column Type Annotation with Knowledge Graphs. arXiv:2409.14556
Freire, S., et al. (2025). Magneto: Combining Small and Large Language Models for Schema Matching. VLDB 2025. arXiv:2412.08194

Modeling & features

Grinsztajn, L., Oyallon, E. & Varoquaux, G. (2022). Why do tree-based models still outperform deep learning on typical tabular data? NeurIPS 2022. arXiv:2207.08815
Wang, S. & Manning, C. (2012). Baselines and Bigrams: Simple, Good Sentiment and Topic Classification. ACL 2012. ACL
Bojanowski, P., Grave, E., Joulin, A. & Mikolov, T. (2017). Enriching Word Vectors with Subword Information. TACL. arXiv:1607.04606
Cavnar, W. & Trenkle, J. (1994). N-gram-based Text Categorization.

Evaluation & methodology

Roberts, D.R., et al. (2017). Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography. DOI:10.1111/ecog.02881
Sokolova, M. & Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information Processing & Management.
Apple Machine Learning Research. Language Identification from Very Short Strings.
Dietterich, T.G. (1998). Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Computation.
Shmueli, G. (2010). To Explain or to Predict? Statistical Science. DOI:10.1214/10-STS330
Opitz, J. & Burst, S. (2021). Macro F1 and Macro F1. arXiv:1911.03347
Karpathy, A. (2026). autoresearch. GitHub

Automatic Semantic Column Detection via Machine Learning

Latest retrain snapshot — 2026-03-21 19:40