Automatic Semantic Column Detection via Machine Learning

Niamoto — Generic Ecological Data Platform

80.84
ProductScore
82.76
GlobalScore
84.7%
EvalSuite Concept
90.4%
EvalSuite Role
2 540
Labeled Columns
61
Concepts
0.7530
Header F1
0.3780
Values F1
0.6388
Fusion F1
80.0%
Frozen Acceptance
9
Eval Datasets

Latest retrain snapshot — 2026-03-21 19:40

After the acquisition wave (`TAXREF v18`, `ETS`, `sPlotOpen`) and full retrain, the dashboard now shows both benchmark families: historical holdouts (`ProductScore 80.84`, `GlobalScore 82.76`) and product evaluation (`EvalSuite 84.7% concept`, `90.4% role`). Product datasets are strong (`niamoto-nc 91.2%`, `niamoto-gb 100%`, `guyadiv 83.6%`), while the frozen coded-inventory benchmark `acceptance-fia-or` remains the main weakness at 63.6% concept.

01Pipeline Architecture

Hybrid 5-layer pipeline: exact aliases → header classification → values classification → fusion → semantic product projection. The goal is not academic perfection on fine-grained concepts, but correct auto-configuration of an import.

Unknown Column

1. EXACT ALIASES

High-precision fast-path (multilingual YAML)
If the name matches a known alias → immediate detection
↓ otherwise

2. HEADER

TF-IDF char n-grams + L1 LogReg
Analyzes the column name
F1 = 0.7530

3. VALUES

37 features + HistGradientBoosting
Statistically analyzes values
F1 = 0.3780

4. FUSION

LogReg + meta-features (disagreement, confidence, coded headers, cross-rank)
Combines probabilities, corrects inter-branch biases
F1 = 0.6388
5. SEMANTIC PROJECTION
role → concept → affordances → transformer/widget suggestions

Exact Aliases — High-precision fast-path

Filecolumn_aliases.yaml — 25 concepts × 8 languages (en, fr, es, pt, de, id, dwc, la)
Exact matchingNormalization → O(1) lookup. If found → confidence 1.0, bypass classifier.
Fuzzy matchingLevenshtein (rapidfuzz) ratio ≥ 80% if no exact match.
SafeguardAn ambiguous alias must be disabled. An exact alias must not create false positives at confidence 1.0.

Header Model — Hyperparameters

analyzerchar_wb — character n-grams respecting word boundaries
ngram_range(2, 5) — character chunks of 2 to 5 (e.g., "di", "dia", "diam", "diame")
max_features5,000 — keeps the 5000 most discriminative n-grams
sublinear_tfTrue — uses log(1 + frequency) instead of raw frequency
C130.0 — inverse regularization strength (high value = confident model)
penaltyL1 (Lasso) — zeros out useless n-grams, automatic feature selection
solversaga — fast algorithm compatible with L1
class_weightbalanced — rare concepts weigh as much as frequent ones

Values Model — 37 statistical features

Numeric statistics (14)mean, std, min, max, skew, kurtosis, Q25, Q50, Q75, range, CV, neg_ratio, int_ratio, zero_ratio
Distribution (3)unique_ratio, null_ratio, entropy
Characters (6)mean_length, std_length, digit_ratio, alpha_ratio, space_ratio, mean_word_count
Regex patterns (4)pct_date_iso, pct_coordinate, pct_boolean, pct_uuid
Biological patterns (2)binomial_score (Genus species), family_suffix (aceae, idae, ales)
Numeric domains (6)mean_decimals, in_lat_range, in_lon_range, in_01_range, small_int_ratio, pct_starts_upper
Meta (2)is_numeric, n_values

Classifier: HistGradientBoostingClassifier(max_iter=500, max_depth=10, learning_rate=0.05, class_weight="balanced")

Fusion Model — Intelligent branch combination

Header input61 probabilities (one per concept) from header model
Values input61 probabilities from values model
Basic meta (3)is_anonymous, null_ratio, unique_ratio
Enriched metamax_proba, top1-top2 margin, entropy, header/values agreement/disagreement, statistic.count flags, code-like header detection
Cross-rank reciprocity (4) — NEWheader_top1_value_rank, value_top1_header_rank, top2_cross_match, both_weak
Targeted dampingIf code-like header + a branch pushes statistic.count → damped confidence before fusion
ClassifierLogisticRegression(C=1.0, solver="lbfgs", class_weight="balanced")

Cross-rank reciprocity features capture how well the two branches agree across their top predictions. They were a strong gain in the previous optimization cycle; after the 2026-03-21 retrain the current baseline is ProductScore 80.84.

How column_aliases fits into the ML system

The YAML alias file is not a separate heuristic living next to the model. It is the first stage of the same semantic detection pipeline. If a header is known exactly, Niamoto resolves it immediately. If not, it falls through to ML.

01

CSV Upload

The user uploads a CSV such as occurrences.csv or plots.csv.

Niamoto reads column names plus a sample of values.

02

Alias Registry

column_aliases.yaml is loaded into an exact-match registry.

If a header like ddlat or tax_fam is known, prediction returns immediately with high confidence.

03

ML Classifier

If no exact alias matches, the column goes to the ML stack:

header model + values model + fusion.

04

Fallback Rules

Only after aliases and ML do a few structural rules apply.

Example: WKT values like POINT(...) can still be recognized as geometry.

05

Product Output

The final semantic type drives auto-config, affordances, widgets, and import suggestions.

This is what the user sees in smart_config and profiling.

06

Training And Evaluation Loop

The same concepts are used to build ml/data/gold_set.json and to evaluate frozen acceptance benchmarks.

So aliases affect both runtime precision and the quality of the training/evaluation loop.

Why we edited aliases before touching the model

  • If a header is stable and known, exact aliases are more reliable than asking ML to rediscover it every time.
  • This is especially true for semi-standardized exports like tax_fam, ddlon, or nbe_stem.
  • Improving aliases is therefore an improvement to the pipeline itself, not a workaround outside it.

When aliases are not enough

  • If the header is unknown, coded, noisy, or ambiguous, the ML classifier must generalize from patterns and values.
  • If train and eval labels contradict each other, both aliases and ML become unreliable.
  • This is why benchmark hygiene and label alignment had to be fixed before adding more data.

02Gold Set & Data Sources

2,525 labeled columns (1,929 gold + 596 synthetic) from 94+ sources across diverse biomes — from boreal forests to tropical mangroves.

Data Sources

SourceTypeCountry / RegionContinentLanguageColumns

03Concept Taxonomy

62 fine-grained concepts grouped into 61 coarse concepts, organized into 10 semantic roles. Grouping improves ML performance by merging rare concepts (< 5 examples).

View the full table of 61 concepts
ConceptRoleExamplesExample column names

04Evaluation Results

Latest eval suite snapshot after the 2026-03-21 acquisition retrain: 90.4% role, 84.7% concept on 9 datasets / 478 columns. The holdout charts below remain useful as historical diagnostics from the previous benchmark-hygiene phase.

4a — ProductScore Breakdown

Weighted composite score across 6 strategically chosen data families. Each family represents a distinct usage scenario for Niamoto.

4b — Holdouts & Diagnostics

Language holdouts: Train without any column from a language, then test only on that language. Simulates a botanist importing a CSV in a language the model has never seen.

Family holdouts: Train without an entire family of datasets, then test on that family. Reveals domains where the model generalizes poorly.

Structural diagnostics: Performance by column structural profile — standard English, field English, coded headers, GBIF core/extended.

Forest inventory sub-split: Detailed breakdown within the weakest family.

05Instance Evaluation

Latest real-world evaluation on actual Niamoto instances and GBIF exports after the acquisition retrain. 418 columns evaluated — 90.9% role accuracy, 85.4% concept accuracy. This excludes the two frozen acceptance datasets.

418 columns evaluated — 90.9% role, 85.4% concept
Historical detail: niamoto-nc — Alias-only vs ML (pre-acquisition snapshot)
MethodRole correctRole %Concept correctConcept %
Alias only11/2938%11/2938%
ML pipeline16/2955%14/2948%

This table is kept as a historical baseline from the pre-acquisition phase. It illustrates why aliases alone were insufficient on opaque niamoto-nc headers before the 2026-03-21 retrain.

06Error Patterns

Error patterns updated after the acquisition wave retrain and full eval suite run on 2026-03-21.

ERROR-01 Partially fixed

Over-prediction of measurement.diameter

Fixed: diversity indices (shannon, pielou, simpson) and holdridge now matched by alias. basal_area → diameter merge bug in concept_taxonomy.py corrected to → biomass.

Remaining: booleans (flower, fruit, in_um) and taxonomic counts (gymnospermae, monocotyledonae, dicotyledonae) still predicted as diameter. The values branch still has insufficient signal to distinguish boolean/count distributions from continuous measurements.

ERROR-02 Fixed

measurement.trait unrecognized → now 67%

Alias registry covers shannon, pielou, simpson, bark_thickness, leaf_ldmc, leaf_thickness. Gold set enriched with 8 trait examples from niamoto-nc. Detection: 0/12 → 8/12 (67%).

ERROR-03 Partially fixed

category.* poorly detected

Fixed: holdridge → category.vegetation via alias. strata → category.vegetation via alias. phenology, pheno → category.ecology via alias. Gold set enriched with bioclimate, phenology, stratum examples.

Remaining: booleans (in_forest, in_um, flower, fruit) still misclassified. The values branch treats True/False and 0/1 as numeric, not categorical.

ERROR-04 Partially fixed

taxonomy.name remains weak on non-binomial GBIF columns

Fixed: binomial species-name columns were real annotation issues and were aligned to taxonomy.species.

Remaining: truly non-binomial fields such as genericName, infraspecificEpithet, and scientificNameAuthorship are still systematically missed on GBIF exports. In the latest eval suite, taxonomy.name is still 0/3.

ERROR-05 Partially fixed

GBIF Darwin Core systematic errors → 9 → 5 remaining

Fixed by alias: catalogNumber → identifier.collection, basisOfRecord → category.basis, occurrenceStatus → category.status, taxonomicStatus → category.status. GBIF concept accuracy: 76% → 88-90%.

Remaining (5): acceptedTaxonKey, speciesKey (numeric keys → need identifier.taxon alias for compound names), genericName, infraspecificEpithet, scientificNameAuthorship.

ERROR-06 Open

Coded inventory generalization remains weak on frozen FIA benchmark

Current state: acceptance-fia-or reaches only 63.6% concept accuracy after retrain, while acceptance-niamoto-gb is at 100%.

Main misses: SPCD, CR, VOLCFNET, VOLBFNET, plus repeated confusions on measurement.biomass, identifier.taxon, and missing category.habitat. This is still the clearest gap between product-focused learning and out-of-train coded inventories.

07Training Evolution

Autoresearch-driven optimization, cross-rank reciprocity features, surrogate evaluation loop, and batch optimization.

7a — Autoresearch Progression

The autoresearch pattern

Inspired by Karpathy (2026): an agent modifies hyperparameters, evaluates, keeps improvements, rejects regressions, and loops.

Result: 24 header iterations (+41 cumulative points), 8 values iterations (+6.5 cumulative points). Each gain tracked by a git commit.

7b — Cross-rank Reciprocity Features

4 new fusion features capturing branch agreement depth

FeatureDescription
header_top1_value_rankRank of header's top-1 prediction in values' probability vector
value_top1_header_rankRank of values' top-1 prediction in header's probability vector
top2_cross_matchWhether either branch's top-2 contains the other's top-1
both_weakBoth branches have low max probability
surrogate-fast: +0.92  |  surrogate-mid: +0.89  |  ProductScore: +0.79

7c — Surrogate Evaluation Loop

Multi-tier validation chain

StageTimePurpose
Cache build~8 minOne-shot feature extraction for all columns
surrogate-fast~1.7s/evalQuick filter for unpromising changes
surrogate-mid~15s/evalGroupKFold with fewer folds
product-score~2 minFull weighted family evaluation
niamoto-score~5 minComplete offline score with all holdouts
Validation chain: fast → mid → product-score → niamoto-score. Each tier gates the next.

7d — Batch Optimization

Training time: 5h → 15min (20x speedup)

Batch fusion feature extraction replaced per-column computation. All feature vectors are computed in a single pass over pre-cached header and values probabilities. Training loop and cross-validation run on pre-computed matrices.

Result: identical model outputs, 20x faster iteration. Enables more autoresearch cycles per session.

Batch optimization produces identical results with 20x speedup.

08Training & Improvement Workflow

End-to-end pipeline from raw data to evaluation. Each box is an independent step with its own script. Full guide: docs/05-ml-detection/training-guide.md

DATA SOURCES

ml/data/silver/
CSV files from inventories, GBIF, traits, etc.
+

LABEL MAPPINGS

build_gold_set.py
Column → concept mappings per dataset

GOLD SET

uv run python -m ml.scripts.data.build_gold_set
2 540 labeled columns × 61 concepts × value statistics
ml/data/gold_set.json

TRAIN HEADER

TF-IDF + LogReg
F1 = 0.7530

TRAIN VALUES

37 features + HGBT
F1 = 0.3780

TRAIN FUSION

LogReg + cross-rank
F1 = 0.6388

ALIAS REGISTRY

column_aliases.yaml
25+ concepts × 8 languages
No retraining needed
+

3 MODELS

ml/models/*.joblib
header (2.6 MB) + values (40 MB)
+ fusion (80 KB)

EVALUATION SUITE

uv run python -m ml.scripts.eval.run_eval_suite
9 datasets × 478 columns × verified annotations
84.7% concept accuracy

DIAGNOSIS & ITERATE

Concept absent → add to gold set & retrain
Column name univoque → add alias (no retrain)
Bad merge in taxonomy → fix concept_taxonomy.py
Wrong annotation → fix eval annotations

8a — Session Improvement Timeline (2026-03-20 → 2026-03-21)

Detail: Improvement steps
StepActionConcept%Delta
V1Initial annotations (418 cols, 7 datasets)66.5%
V2Fix annotations: taxonomy.name → species, plot_name context71.1%+4.6
V3Verify annotations against actual values71.3%+0.2
V4Expand alias registry (+13 concepts)76.6%+5.3
V5Enrich gold set (niamoto-nc) + retrain + fix basal_area merge77.5%+0.9
V6Acquire TAXREF v18 + ETS + sPlotOpen, rebuild gold set, full retrain85.4%+7.9

09Methodology Justification

Each technical choice is justified by academic literature. Bracket numbers refer to the References section.

8.1 — Why character n-grams?

TfidfVectorizer(analyzer="char_wb", ngram_range=(2, 5))

Ecological column names are short strings (1-3 words) sharing Latin and Greek roots across languages. "diametre" (FR) and "diametro" (ES) generate the same trigrams: dia, iam, ame. Character n-grams capture these shared morphemes without a bilingual dictionary.

For strings under 50 characters (the typical length of a column name), character n-grams outperform word-level methods because there are too few tokens for stable statistics [11, 14].

The approach is the canonical choice for short multilingual strings.
[10] Bojanowski et al. (2017). Enriching Word Vectors with Subword Information. arXiv:1607.04606
[11] Cavnar & Trenkle (1994). N-gram-based Text Categorization. 99.8% accuracy for language identification.
[14] Apple Research. Language Identification from Very Short Strings.

8.2 — Why TF-IDF + LogisticRegression L1 instead of Transformers?

LogisticRegression(C=130, penalty="l1", class_weight="balanced")

With 2,525 labeled columns, a BERT model (110M parameters) risks severe overfitting. TF-IDF + L1 LogReg is an "embarrassingly strong baseline" that produces interpretable coefficients — one can inspect exactly which n-grams drive each prediction.

The L1 penalty automatically selects useful features: out of 5,000 n-grams, only a few hundred receive non-zero weights. It is a built-in filter, not a black box.

CriterionTF-IDF + LR (our choice)Fine-tuned BERTSentence Transformers
Data required1k+10k-100k1k+
Training time~3 seconds10-60 min (GPU)5-30 min
InterpretabilityL1 coefficientsBlack boxBlack box
Offline / no GPUYesNo (GPU)Yes (22 MB)
Model size~3 MB~440 MB~90 MB
The choice is validated by the literature as scientifically robust, not as a compromise.
[9] Wang & Manning, ACL 2012. Baselines and Bigrams: Simple, Good Sentiment and Topic Classification. ACL
[16] Shmueli, Statistical Science 2010. To Explain or to Predict? DOI

8.3 — Why HistGradientBoosting for values?

HistGradientBoostingClassifier(max_iter=500, max_depth=10, lr=0.05)

The reference paper on tabular data (Grinsztajn et al., NeurIPS 2022) shows that tree-based models dominate neural networks on small-to-medium datasets (< 10k rows), especially when some features are uninformative — which is exactly our case (e.g., in_lat_range is useless for a non-geographic concept).

Our 37 features are a streamlined, interpretable version of Sherlock's 1,588 features [1], adapted to the size of our gold set.

HistGradientBoosting is the canonical choice for tabular data with < 10k samples.
[8] Grinsztajn, Oyallon & Varoquaux, NeurIPS 2022. Why do tree-based models still outperform deep learning on tabular data? arXiv:2207.08815
[1] Hulsebos et al., KDD 2019. Sherlock: A Deep Learning Approach to Semantic Data Type Detection. arXiv:1905.10688

8.4 — Why GroupKFold by data source?

GroupKFold(n_splits=5, groups=source_dataset)

Columns from the same dataset share naming conventions, data quality, and similar sampling protocols. A naive KFold would leak this shared information between train and test, artificially inflating scores.

This is the exact analogue of spatial block cross-validation used in species distribution modeling — observations from the same geographic block are kept together to avoid spatial autocorrelation.

Roberts et al. (2017) is the reference for structured ecological data: "standard random cross-validation on structured data leads to serious underestimation of predictive error."

GroupKFold is the correct evaluation strategy for grouped ecological data.
[12] Roberts et al., Ecography 2017. Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. DOI
[15] Dietterich, Neural Computation 1998. Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms.

8.5 — Why macro-F1?

f1_score(y_true, y_pred, average="macro")

Macro-F1 computes the F1 for each concept separately, then takes the unweighted average. A rare concept (5 examples: wood_density) weighs as much as a frequent one (191 examples: species).

In ecology, rare concepts are often the most scientifically valuable (soil chemistry, phenological dates, conservation status). Accuracy or micro-F1 would mask catastrophic performance on these minority concepts.

MetricTreatment of rare classesOur case (61 concepts, imbalanced distribution)
AccuracyDominated by frequent classesCan show ~70% even if all rare concepts are misclassified
Micro-F1Equivalent to accuracySame problem
Weighted-F1Proportional to class sizeMasks poor performance on rare concepts
Macro-F1Equal weight for each classForces the model to also detect rare concepts well
Macro-F1 is the standard for imbalanced multi-class problems where minorities matter.
[13] Sokolova & Lapalme, Information Processing & Management 2009. A systematic analysis of performance measures for classification tasks.
[17] Opitz & Burst (2021). Macro F1 and Macro F1. arXiv:1911.03347

10Credible Alternatives

Comparison with published approaches in the literature. Our pipeline occupies a justified trade-off between the complexity of deep learning methods and the brittleness of manual rules.

Approach Data required Offline Reported F1 Complexity Niamoto status Reference
TF-IDF + LR + HistGBT (Niamoto) 2.54k columns Yes (scikit-learn) 0.639 CV / 84.7% eval Low Implemented This work
Sherlock (DNN, 1588 features) 686k columns Yes (Torch) 0.89 (weighted) High Insufficient data [1]
Sato (CRF + table context) 686k columns Yes (Torch) 0.93 (weighted) High Insufficient data [2]
DoDuo (fine-tuned BERT) 10k+ columns Yes (GPU) 0.90+ High Insufficient data [4]
LLM zero-shot (GPT-4 / Claude) 0 No (API) 0.85+ (EN) Low Gold set enrichment [5]
Sentence Transformers 1k+ columns Yes (22-90 MB) ~0.80 (estimated) Medium Not planned [10]
Magneto (SLM + LLM) Variable Yes (inference) SotA High To evaluate [7]
Regex / manual rules 0 Yes 0.20-0.40 Very low Integrated as features --
Why deep learning methods are not suited to our context

Gold set size: With 2,525 columns from 94+ sources, we are 2 to 3 orders of magnitude below the typical needs of Sherlock (686k) or BERT (10k minimum). Fine-tuning a 110M parameter model on 2.5k examples would produce overfitting.

Source heterogeneity: Our data spans 8 languages, 5 continents, and highly varied naming conventions. Academic benchmarks (VizNet, WikiTables) are primarily English.

Offline constraint: Niamoto is designed to work without internet (fieldwork, oceanographic vessels). No embedded GPU.

Estimated break-even: Per Grinsztajn et al. (2022), deep learning surpasses boosted trees beyond ~10,000-50,000 examples. With ~5,000+ labeled columns, it would become relevant to explore Sentence Transformers or a lightweight fine-tuned BERT.

11Bibliography

Column type detection
  1. Hulsebos, T., Hu, K., Bakker, M., et al. (2019). Sherlock: A Deep Learning Approach to Semantic Data Type Detection. KDD 2019. arXiv:1905.10688
  2. Zhang, D., Suhara, Y., Li, J., et al. (2020). Sato: Contextual Semantic Type Detection in Tables. VLDB 2020. arXiv:1911.06311
  3. Deng, X., Sun, H., Lees, A., et al. (2021). TURL: Table Understanding through Representation Learning. VLDB 2021. arXiv:2006.14806
  4. Suhara, Y., Li, J., Li, Y., et al. (2022). Annotating Columns with Pre-trained Language Models. SIGMOD 2022. arXiv:2104.01785
  5. Korini, K. & Bizer, C. (2023). Column Type Annotation using ChatGPT. VLDB Workshop 2023. arXiv:2306.00745
  6. RACOON (2024). Retrieval-Augmented Column Type Annotation with Knowledge Graphs. arXiv:2409.14556
  7. Freire, S., et al. (2025). Magneto: Combining Small and Large Language Models for Schema Matching. VLDB 2025. arXiv:2412.08194
Modeling & features
  1. Grinsztajn, L., Oyallon, E. & Varoquaux, G. (2022). Why do tree-based models still outperform deep learning on typical tabular data? NeurIPS 2022. arXiv:2207.08815
  2. Wang, S. & Manning, C. (2012). Baselines and Bigrams: Simple, Good Sentiment and Topic Classification. ACL 2012. ACL
  3. Bojanowski, P., Grave, E., Joulin, A. & Mikolov, T. (2017). Enriching Word Vectors with Subword Information. TACL. arXiv:1607.04606
  4. Cavnar, W. & Trenkle, J. (1994). N-gram-based Text Categorization.
Evaluation & methodology
  1. Roberts, D.R., et al. (2017). Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography. DOI:10.1111/ecog.02881
  2. Sokolova, M. & Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information Processing & Management.
  3. Apple Machine Learning Research. Language Identification from Very Short Strings.
  4. Dietterich, T.G. (1998). Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Computation.
  5. Shmueli, G. (2010). To Explain or to Predict? Statistical Science. DOI:10.1214/10-STS330
  6. Opitz, J. & Burst, S. (2021). Macro F1 and Macro F1. arXiv:1911.03347
  7. Karpathy, A. (2026). autoresearch. GitHub
Niamoto — Generic Ecological Data Platform · Updated 2026-03-21 · Branch feat/ml-detection-improvement