ADR 0004 — Generic Import System: Configuration-Driven Architecture¶
Status: Adopted (2025-10-13)
Context¶
Prior to this refactoring, Niamoto’s import system was based on four specialized importers:
TaxonomyImporter— Hardcoded to import totaxon_reftable with nested set calculationsPlotImporter— Hardcoded to import toplot_reftableShapeImporter— Hardcoded to import toshape_reftableOccurrenceImporter— Hardcoded to import tooccurrencestable
Problems with the Legacy System¶
Inflexibility: Could only import pre-defined entity types to fixed tables
Code Duplication: Each importer repeated CSV validation, table creation, and geometry handling logic
Configuration Coupling: Transform/Export plugins directly accessed
taxon_ref,plot_ref,shape_ref— couldn’t work with other entity typesNested Set Overhead: Taxonomy hierarchies required expensive lft/rght recalculation on every import
Limited Extensibility: Adding new entity types (habitats, sites, custom references) required new specialized importers
The system couldn’t support common use cases like:
Importing a third-party taxonomy as a reference
Creating hierarchies from derived data (e.g., extracting taxonomy from occurrence records)
Defining custom reference entities for project-specific needs
Decision¶
We implemented a Generic Import System driven by declarative YAML configuration (import.yml), eliminating all specialized importers in favor of a unified engine.
Core Architecture¶
Entity Registry (
core/imports/registry.py)Central metadata service describing all entities in the system
Stores entity type, physical table name, schema, links, and aliases
Replaces hardcoded assumptions about
taxon_ref,plot_ref,shape_refProvides introspection API for Transform/Export/GUI services
Configuration Models (
core/imports/config_models.py)Pydantic models define entity types:
Dataset,Reference(hierarchical/spatial)Validates connector types, schema fields, hierarchy strategies, enrichment configs
Supports derived references extracted from datasets
Generic Import Engine (
core/imports/engine.py)Orchestrates import execution plan: Datasets → Derived References → Direct References
Uses DuckDB connectors (
read_csv_auto, spatial extension) for efficient ingestionValidates data against schema, builds hierarchies, applies enrichments
Hierarchy Builder (
core/imports/hierarchy_builder.py)Supports multiple strategies: adjacency list, nested set (legacy compatibility)
For derived references: extracts hierarchy from source data columns
Uses DuckDB recursive CTEs for adjacency list construction
Hash-based ID generation ensures stable IDs across reimports
Three Entity Types¶
Datasets — Source data tables (e.g., occurrences, observations)
entities:
datasets:
occurrences:
connector:
type: file
format: csv
path: imports/occurrences.csv
Derived References — Hierarchies extracted from datasets
entities:
references:
taxonomy:
kind: hierarchical
connector:
type: derived
source: occurrences
extraction:
levels:
- name: family
column: family
- name: genus
column: genus
Direct References — Hierarchies loaded from files
entities:
references:
plots:
kind: hierarchical
connector:
type: file
format: csv
path: imports/plots.csv
Plugin Genericization (Phase 8)¶
All 19 plugins were refactored to:
Accept
EntityRegistryinstead ofConfig/DatabaseobjectsResolve table names dynamically via registry instead of hardcoding
Support arbitrary entity types (not just
taxon_ref,plot_ref,shape_ref)
Example transformation:
# Before (coupled to taxon_ref)
def load(self, config: Config):
query = "SELECT * FROM taxon_ref WHERE..."
# After (generic via registry)
def load(self, registry: EntityRegistry, entity_name: str):
table = registry.get_table_name(entity_name)
query = f"SELECT * FROM {table} WHERE..."
Refactored plugins include:
Loaders:
direct_reference,join_table,nested_set,adjacency_list,stats_loaderTransformers:
database_aggregator,field_aggregator,top_ranking,direct_attribute,geospatial_extractor,multi_column_extractor,niamoto_to_dwc_occurrence,shape_processorExporters:
dwc_archive_exporter,html_page_exporter,index_generator,json_api_exporter
Implementation Phases (Completed)¶
Phase 0-3: Foundation¶
✅ Pydantic configuration models
✅ DuckDB integration (see ADR 0001)
✅ Entity Registry implementation
✅ Generic import engine with execution plan
Phase 4-5: Hierarchy Systems¶
✅ Adjacency list builder with hash-based IDs
✅ Derived reference extraction from datasets (see ADR 0003)
✅ Multi-source spatial references
Phase 6-7: Service Integration¶
✅ CLI migration to use EntityRegistry
✅ Transform/Export services consume registry
✅ GUI API endpoints adapted for dynamic entities
Phase 8: Plugin Genericization¶
✅ All 19 plugins refactored to accept EntityRegistry
✅ Dynamic table resolution removes hardcoded assumptions
✅ Plugins now work with any entity type
Consequences¶
Positive¶
Flexibility: Can define any entity type in
import.ymlwithout code changesMaintainability: Single import engine eliminates code duplication
Performance: DuckDB direct ingestion and recursive CTEs are faster than SQLite+pandas
Extensibility: New entity types, connectors, or hierarchy strategies are configuration changes
Decoupling: Plugins resolve tables via registry — no hardcoded dependencies
Stability: Hash-based IDs ensure referential integrity across reimports
Challenges & Migration Requirements¶
Breaking Change: Existing projects must migrate from SQLite schema to DuckDB
Configuration Migration: Old CLI workflows must be converted to
import.ymlformatPlugin Updates: Any custom plugins need EntityRegistry integration
Documentation: Extensive examples needed for new configuration syntax
GUI Adaptation: Interface must support dynamic entity definition (in progress)
Technical Debt Retired¶
❌ Removed:
core/components/imports/(TaxonomyImporter, PlotImporter, ShapeImporter, OccurrenceImporter)❌ Removed:
core/models/models.py(rigid SQLAlchemy models for taxon_ref, plot_ref, shape_ref)❌ Removed:
core/repositories/niamoto_repository.py(tightly coupled data access)✅ Replaced: Hardcoded table names in 19 plugins with dynamic registry resolution
Validation¶
The system has been validated through:
✅ 89 unit tests covering config models, registry, hierarchy builder, engine
✅ Integration tests for full import workflows (datasets → derived → direct)
✅ Plugin tests demonstrating generic entity support
✅ CLI tests for import/transform/export workflows
✅ Real-world usage with New Caledonia biodiversity data (test-instance)
Next Steps¶
GUI Adaptation: Update import wizard to support dynamic entity definition via
import.ymlMigration Guide: Document transition path for existing Niamoto projects
Performance Benchmarking: Compare DuckDB generic engine vs. legacy SQLite importers
Advanced Features: Implement validation rules, conditional enrichments, incremental imports
Status¶
COMPLETE — All 8 implementation phases finished as of 2025-10-13. The generic import system is operational and all legacy importers have been removed. GUI adaptation is the remaining work item.