Niamoto — Generic Ecological Data Platform
After the acquisition wave (`TAXREF v18`, `ETS`, `sPlotOpen`) and full retrain, the dashboard now shows both benchmark families: historical holdouts (`ProductScore 80.84`, `GlobalScore 82.76`) and product evaluation (`EvalSuite 84.7% concept`, `90.4% role`). Product datasets are strong (`niamoto-nc 91.2%`, `niamoto-gb 100%`, `guyadiv 83.6%`), while the frozen coded-inventory benchmark `acceptance-fia-or` remains the main weakness at 63.6% concept.
Hybrid 5-layer pipeline: exact aliases → header classification → values classification → fusion → semantic product projection. The goal is not academic perfection on fine-grained concepts, but correct auto-configuration of an import.
| File | column_aliases.yaml — 25 concepts × 8 languages (en, fr, es, pt, de, id, dwc, la) |
| Exact matching | Normalization → O(1) lookup. If found → confidence 1.0, bypass classifier. |
| Fuzzy matching | Levenshtein (rapidfuzz) ratio ≥ 80% if no exact match. |
| Safeguard | An ambiguous alias must be disabled. An exact alias must not create false positives at confidence 1.0. |
| analyzer | char_wb — character n-grams respecting word boundaries |
| ngram_range | (2, 5) — character chunks of 2 to 5 (e.g., "di", "dia", "diam", "diame") |
| max_features | 5,000 — keeps the 5000 most discriminative n-grams |
| sublinear_tf | True — uses log(1 + frequency) instead of raw frequency |
| C | 130.0 — inverse regularization strength (high value = confident model) |
| penalty | L1 (Lasso) — zeros out useless n-grams, automatic feature selection |
| solver | saga — fast algorithm compatible with L1 |
| class_weight | balanced — rare concepts weigh as much as frequent ones |
| Numeric statistics (14) | mean, std, min, max, skew, kurtosis, Q25, Q50, Q75, range, CV, neg_ratio, int_ratio, zero_ratio |
| Distribution (3) | unique_ratio, null_ratio, entropy |
| Characters (6) | mean_length, std_length, digit_ratio, alpha_ratio, space_ratio, mean_word_count |
| Regex patterns (4) | pct_date_iso, pct_coordinate, pct_boolean, pct_uuid |
| Biological patterns (2) | binomial_score (Genus species), family_suffix (aceae, idae, ales) |
| Numeric domains (6) | mean_decimals, in_lat_range, in_lon_range, in_01_range, small_int_ratio, pct_starts_upper |
| Meta (2) | is_numeric, n_values |
Classifier: HistGradientBoostingClassifier(max_iter=500, max_depth=10, learning_rate=0.05, class_weight="balanced")
| Header input | 61 probabilities (one per concept) from header model |
| Values input | 61 probabilities from values model |
| Basic meta (3) | is_anonymous, null_ratio, unique_ratio |
| Enriched meta | max_proba, top1-top2 margin, entropy, header/values agreement/disagreement, statistic.count flags, code-like header detection |
| Cross-rank reciprocity (4) — NEW | header_top1_value_rank, value_top1_header_rank, top2_cross_match, both_weak |
| Targeted damping | If code-like header + a branch pushes statistic.count → damped confidence before fusion |
| Classifier | LogisticRegression(C=1.0, solver="lbfgs", class_weight="balanced") |
Cross-rank reciprocity features capture how well the two branches agree across their top predictions. They were a strong gain in the previous optimization cycle; after the 2026-03-21 retrain the current baseline is ProductScore 80.84.
column_aliases fits into the ML systemThe YAML alias file is not a separate heuristic living next to the model. It is the first stage of the same semantic detection pipeline. If a header is known exactly, Niamoto resolves it immediately. If not, it falls through to ML.
The user uploads a CSV such as occurrences.csv or plots.csv.
Niamoto reads column names plus a sample of values.
column_aliases.yaml is loaded into an exact-match registry.
If a header like ddlat or tax_fam is known, prediction returns immediately with high confidence.
If no exact alias matches, the column goes to the ML stack:
header model + values model + fusion.
Only after aliases and ML do a few structural rules apply.
Example: WKT values like POINT(...) can still be recognized as geometry.
The final semantic type drives auto-config, affordances, widgets, and import suggestions.
This is what the user sees in smart_config and profiling.
The same concepts are used to build ml/data/gold_set.json and to evaluate frozen acceptance benchmarks.
So aliases affect both runtime precision and the quality of the training/evaluation loop.
tax_fam, ddlon, or nbe_stem.2,525 labeled columns (1,929 gold + 596 synthetic) from 94+ sources across diverse biomes — from boreal forests to tropical mangroves.
| Source | Type | Country / Region | Continent | Language | Columns |
|---|
62 fine-grained concepts grouped into 61 coarse concepts, organized into 10 semantic roles. Grouping improves ML performance by merging rare concepts (< 5 examples).
| Concept | Role | Examples | Example column names |
|---|
Latest eval suite snapshot after the 2026-03-21 acquisition retrain: 90.4% role, 84.7% concept on 9 datasets / 478 columns. The holdout charts below remain useful as historical diagnostics from the previous benchmark-hygiene phase.
Weighted composite score across 6 strategically chosen data families. Each family represents a distinct usage scenario for Niamoto.
Language holdouts: Train without any column from a language, then test only on that language. Simulates a botanist importing a CSV in a language the model has never seen.
Family holdouts: Train without an entire family of datasets, then test on that family. Reveals domains where the model generalizes poorly.
Structural diagnostics: Performance by column structural profile — standard English, field English, coded headers, GBIF core/extended.
Forest inventory sub-split: Detailed breakdown within the weakest family.
Latest real-world evaluation on actual Niamoto instances and GBIF exports after the acquisition retrain. 418 columns evaluated — 90.9% role accuracy, 85.4% concept accuracy. This excludes the two frozen acceptance datasets.
| Method | Role correct | Role % | Concept correct | Concept % |
|---|---|---|---|---|
| Alias only | 11/29 | 38% | 11/29 | 38% |
| ML pipeline | 16/29 | 55% | 14/29 | 48% |
This table is kept as a historical baseline from the pre-acquisition phase. It illustrates why aliases alone were insufficient on opaque niamoto-nc headers before the 2026-03-21 retrain.
Error patterns updated after the acquisition wave retrain and full eval suite run on 2026-03-21.
measurement.diameterFixed: diversity indices (shannon, pielou, simpson) and holdridge now matched by alias. basal_area → diameter merge bug in concept_taxonomy.py corrected to → biomass.
Remaining: booleans (flower, fruit, in_um) and taxonomic counts (gymnospermae, monocotyledonae, dicotyledonae) still predicted as diameter. The values branch still has insufficient signal to distinguish boolean/count distributions from continuous measurements.
measurement.trait unrecognized → now 67%Alias registry covers shannon, pielou, simpson, bark_thickness, leaf_ldmc, leaf_thickness. Gold set enriched with 8 trait examples from niamoto-nc. Detection: 0/12 → 8/12 (67%).
category.* poorly detectedFixed: holdridge → category.vegetation via alias. strata → category.vegetation via alias. phenology, pheno → category.ecology via alias. Gold set enriched with bioclimate, phenology, stratum examples.
Remaining: booleans (in_forest, in_um, flower, fruit) still misclassified. The values branch treats True/False and 0/1 as numeric, not categorical.
taxonomy.name remains weak on non-binomial GBIF columnsFixed: binomial species-name columns were real annotation issues and were aligned to taxonomy.species.
Remaining: truly non-binomial fields such as genericName, infraspecificEpithet, and scientificNameAuthorship are still systematically missed on GBIF exports. In the latest eval suite, taxonomy.name is still 0/3.
Fixed by alias: catalogNumber → identifier.collection, basisOfRecord → category.basis, occurrenceStatus → category.status, taxonomicStatus → category.status. GBIF concept accuracy: 76% → 88-90%.
Remaining (5): acceptedTaxonKey, speciesKey (numeric keys → need identifier.taxon alias for compound names), genericName, infraspecificEpithet, scientificNameAuthorship.
Current state: acceptance-fia-or reaches only 63.6% concept accuracy after retrain, while acceptance-niamoto-gb is at 100%.
Main misses: SPCD, CR, VOLCFNET, VOLBFNET, plus repeated confusions on measurement.biomass, identifier.taxon, and missing category.habitat. This is still the clearest gap between product-focused learning and out-of-train coded inventories.
Autoresearch-driven optimization, cross-rank reciprocity features, surrogate evaluation loop, and batch optimization.
Inspired by Karpathy (2026): an agent modifies hyperparameters, evaluates, keeps improvements, rejects regressions, and loops.
Result: 24 header iterations (+41 cumulative points), 8 values iterations (+6.5 cumulative points). Each gain tracked by a git commit.
| Feature | Description |
|---|---|
| header_top1_value_rank | Rank of header's top-1 prediction in values' probability vector |
| value_top1_header_rank | Rank of values' top-1 prediction in header's probability vector |
| top2_cross_match | Whether either branch's top-2 contains the other's top-1 |
| both_weak | Both branches have low max probability |
| Stage | Time | Purpose |
|---|---|---|
| Cache build | ~8 min | One-shot feature extraction for all columns |
| surrogate-fast | ~1.7s/eval | Quick filter for unpromising changes |
| surrogate-mid | ~15s/eval | GroupKFold with fewer folds |
| product-score | ~2 min | Full weighted family evaluation |
| niamoto-score | ~5 min | Complete offline score with all holdouts |
Batch fusion feature extraction replaced per-column computation. All feature vectors are computed in a single pass over pre-cached header and values probabilities. Training loop and cross-validation run on pre-computed matrices.
Result: identical model outputs, 20x faster iteration. Enables more autoresearch cycles per session.
End-to-end pipeline from raw data to evaluation. Each box is an independent step with its own script. Full guide: docs/05-ml-detection/training-guide.md
| Step | Action | Concept% | Delta |
|---|---|---|---|
| V1 | Initial annotations (418 cols, 7 datasets) | 66.5% | — |
| V2 | Fix annotations: taxonomy.name → species, plot_name context | 71.1% | +4.6 |
| V3 | Verify annotations against actual values | 71.3% | +0.2 |
| V4 | Expand alias registry (+13 concepts) | 76.6% | +5.3 |
| V5 | Enrich gold set (niamoto-nc) + retrain + fix basal_area merge | 77.5% | +0.9 |
| V6 | Acquire TAXREF v18 + ETS + sPlotOpen, rebuild gold set, full retrain | 85.4% | +7.9 |
Each technical choice is justified by academic literature. Bracket numbers refer to the References section.
Ecological column names are short strings (1-3 words) sharing Latin and Greek roots across languages. "diametre" (FR) and "diametro" (ES) generate the same trigrams: dia, iam, ame. Character n-grams capture these shared morphemes without a bilingual dictionary.
For strings under 50 characters (the typical length of a column name), character n-grams outperform word-level methods because there are too few tokens for stable statistics [11, 14].
With 2,525 labeled columns, a BERT model (110M parameters) risks severe overfitting. TF-IDF + L1 LogReg is an "embarrassingly strong baseline" that produces interpretable coefficients — one can inspect exactly which n-grams drive each prediction.
The L1 penalty automatically selects useful features: out of 5,000 n-grams, only a few hundred receive non-zero weights. It is a built-in filter, not a black box.
| Criterion | TF-IDF + LR (our choice) | Fine-tuned BERT | Sentence Transformers |
|---|---|---|---|
| Data required | 1k+ | 10k-100k | 1k+ |
| Training time | ~3 seconds | 10-60 min (GPU) | 5-30 min |
| Interpretability | L1 coefficients | Black box | Black box |
| Offline / no GPU | Yes | No (GPU) | Yes (22 MB) |
| Model size | ~3 MB | ~440 MB | ~90 MB |
The reference paper on tabular data (Grinsztajn et al., NeurIPS 2022) shows that tree-based models dominate neural networks on small-to-medium datasets (< 10k rows), especially when some features are uninformative — which is exactly our case (e.g., in_lat_range is useless for a non-geographic concept).
Our 37 features are a streamlined, interpretable version of Sherlock's 1,588 features [1], adapted to the size of our gold set.
Columns from the same dataset share naming conventions, data quality, and similar sampling protocols. A naive KFold would leak this shared information between train and test, artificially inflating scores.
This is the exact analogue of spatial block cross-validation used in species distribution modeling — observations from the same geographic block are kept together to avoid spatial autocorrelation.
Roberts et al. (2017) is the reference for structured ecological data: "standard random cross-validation on structured data leads to serious underestimation of predictive error."
Macro-F1 computes the F1 for each concept separately, then takes the unweighted average. A rare concept (5 examples: wood_density) weighs as much as a frequent one (191 examples: species).
In ecology, rare concepts are often the most scientifically valuable (soil chemistry, phenological dates, conservation status). Accuracy or micro-F1 would mask catastrophic performance on these minority concepts.
| Metric | Treatment of rare classes | Our case (61 concepts, imbalanced distribution) |
|---|---|---|
| Accuracy | Dominated by frequent classes | Can show ~70% even if all rare concepts are misclassified |
| Micro-F1 | Equivalent to accuracy | Same problem |
| Weighted-F1 | Proportional to class size | Masks poor performance on rare concepts |
| Macro-F1 | Equal weight for each class | Forces the model to also detect rare concepts well |
Comparison with published approaches in the literature. Our pipeline occupies a justified trade-off between the complexity of deep learning methods and the brittleness of manual rules.
| Approach | Data required | Offline | Reported F1 | Complexity | Niamoto status | Reference |
|---|---|---|---|---|---|---|
| TF-IDF + LR + HistGBT (Niamoto) | 2.54k columns | Yes (scikit-learn) | 0.639 CV / 84.7% eval | Low | Implemented | This work |
| Sherlock (DNN, 1588 features) | 686k columns | Yes (Torch) | 0.89 (weighted) | High | Insufficient data | [1] |
| Sato (CRF + table context) | 686k columns | Yes (Torch) | 0.93 (weighted) | High | Insufficient data | [2] |
| DoDuo (fine-tuned BERT) | 10k+ columns | Yes (GPU) | 0.90+ | High | Insufficient data | [4] |
| LLM zero-shot (GPT-4 / Claude) | 0 | No (API) | 0.85+ (EN) | Low | Gold set enrichment | [5] |
| Sentence Transformers | 1k+ columns | Yes (22-90 MB) | ~0.80 (estimated) | Medium | Not planned | [10] |
| Magneto (SLM + LLM) | Variable | Yes (inference) | SotA | High | To evaluate | [7] |
| Regex / manual rules | 0 | Yes | 0.20-0.40 | Very low | Integrated as features | -- |
Gold set size: With 2,525 columns from 94+ sources, we are 2 to 3 orders of magnitude below the typical needs of Sherlock (686k) or BERT (10k minimum). Fine-tuning a 110M parameter model on 2.5k examples would produce overfitting.
Source heterogeneity: Our data spans 8 languages, 5 continents, and highly varied naming conventions. Academic benchmarks (VizNet, WikiTables) are primarily English.
Offline constraint: Niamoto is designed to work without internet (fieldwork, oceanographic vessels). No embedded GPU.
Estimated break-even: Per Grinsztajn et al. (2022), deep learning surpasses boosted trees beyond ~10,000-50,000 examples. With ~5,000+ labeled columns, it would become relevant to explore Sentence Transformers or a lightweight fine-tuned BERT.
feat/ml-detection-improvement