ADR 0004 — Generic Import System: Configuration-Driven Architecture¶

Status: Adopted (2025-10-13)

Context¶

Prior to this refactoring, Niamoto’s import system was based on four specialized importers:

TaxonomyImporter — Hardcoded to import to taxon_ref table with nested set calculations
PlotImporter — Hardcoded to import to plot_ref table
ShapeImporter — Hardcoded to import to shape_ref table
OccurrenceImporter — Hardcoded to import to occurrences table

Problems with the Legacy System¶

Inflexibility: Could only import pre-defined entity types to fixed tables
Code Duplication: Each importer repeated CSV validation, table creation, and geometry handling logic
Configuration Coupling: Transform/Export plugins directly accessed taxon_ref, plot_ref, shape_ref — couldn’t work with other entity types
Nested Set Overhead: Taxonomy hierarchies required expensive lft/rght recalculation on every import
Limited Extensibility: Adding new entity types (habitats, sites, custom references) required new specialized importers

The system couldn’t support common use cases like:

Importing a third-party taxonomy as a reference
Creating hierarchies from derived data (e.g., extracting taxonomy from occurrence records)
Defining custom reference entities for project-specific needs

Decision¶

We implemented a Generic Import System driven by declarative YAML configuration (import.yml), eliminating all specialized importers in favor of a unified engine.

Core Architecture¶

Entity Registry (core/imports/registry.py)
- Central metadata service describing all entities in the system
- Stores entity type, physical table name, schema, links, and aliases
- Replaces hardcoded assumptions about taxon_ref, plot_ref, shape_ref
- Provides introspection API for Transform/Export/GUI services
Configuration Models (core/imports/config_models.py)
- Pydantic models define entity types: Dataset, Reference (hierarchical/spatial)
- Validates connector types, schema fields, hierarchy strategies, enrichment configs
- Supports derived references extracted from datasets
Generic Import Engine (core/imports/engine.py)
- Orchestrates import execution plan: Datasets → Derived References → Direct References
- Uses DuckDB connectors (read_csv_auto, spatial extension) for efficient ingestion
- Validates data against schema, builds hierarchies, applies enrichments
Hierarchy Builder (core/imports/hierarchy_builder.py)
- Supports multiple strategies: adjacency list, nested set (legacy compatibility)
- For derived references: extracts hierarchy from source data columns
- Uses DuckDB recursive CTEs for adjacency list construction
- Hash-based ID generation ensures stable IDs across reimports

Three Entity Types¶

Datasets — Source data tables (e.g., occurrences, observations)

entities:
  datasets:
    occurrences:
      connector:
        type: file
        format: csv
        path: imports/occurrences.csv

Derived References — Hierarchies extracted from datasets

entities:
  references:
    taxonomy:
      kind: hierarchical
      connector:
        type: derived
        source: occurrences
        extraction:
          levels:
            - name: family
              column: family
            - name: genus
              column: genus

Direct References — Hierarchies loaded from files

entities:
  references:
    plots:
      kind: hierarchical
      connector:
        type: file
        format: csv
        path: imports/plots.csv

Plugin Genericization (Phase 8)¶

All 19 plugins were refactored to:

Accept EntityRegistry instead of Config/Database objects
Resolve table names dynamically via registry instead of hardcoding
Support arbitrary entity types (not just taxon_ref, plot_ref, shape_ref)

Example transformation:

# Before (coupled to taxon_ref)
def load(self, config: Config):
    query = "SELECT * FROM taxon_ref WHERE..."

# After (generic via registry)
def load(self, registry: EntityRegistry, entity_name: str):
    table = registry.get_table_name(entity_name)
    query = f"SELECT * FROM {table} WHERE..."

Refactored plugins include:

Loaders: direct_reference, join_table, nested_set, adjacency_list, stats_loader
Transformers: database_aggregator, field_aggregator, top_ranking, direct_attribute, geospatial_extractor, multi_column_extractor, niamoto_to_dwc_occurrence, shape_processor
Exporters: dwc_archive_exporter, html_page_exporter, index_generator, json_api_exporter

Implementation Phases (Completed)¶

Phase 0-3: Foundation¶

✅ Pydantic configuration models
✅ DuckDB integration (see ADR 0001)
✅ Entity Registry implementation
✅ Generic import engine with execution plan

Phase 4-5: Hierarchy Systems¶

✅ Adjacency list builder with hash-based IDs
✅ Derived reference extraction from datasets (see ADR 0003)
✅ Multi-source spatial references

Phase 6-7: Service Integration¶

✅ CLI migration to use EntityRegistry
✅ Transform/Export services consume registry
✅ GUI API endpoints adapted for dynamic entities

Phase 8: Plugin Genericization¶

✅ All 19 plugins refactored to accept EntityRegistry
✅ Dynamic table resolution removes hardcoded assumptions
✅ Plugins now work with any entity type

Consequences¶

Positive¶

Flexibility: Can define any entity type in import.yml without code changes
Maintainability: Single import engine eliminates code duplication
Performance: DuckDB direct ingestion and recursive CTEs are faster than SQLite+pandas
Extensibility: New entity types, connectors, or hierarchy strategies are configuration changes
Decoupling: Plugins resolve tables via registry — no hardcoded dependencies
Stability: Hash-based IDs ensure referential integrity across reimports

Challenges & Migration Requirements¶

Breaking Change: Existing projects must migrate from SQLite schema to DuckDB
Configuration Migration: Old CLI workflows must be converted to import.yml format
Plugin Updates: Any custom plugins need EntityRegistry integration
Documentation: Extensive examples needed for new configuration syntax
GUI Adaptation: Interface must support dynamic entity definition (in progress)

Technical Debt Retired¶

❌ Removed: core/components/imports/ (TaxonomyImporter, PlotImporter, ShapeImporter, OccurrenceImporter)
❌ Removed: core/models/models.py (rigid SQLAlchemy models for taxon_ref, plot_ref, shape_ref)
❌ Removed: core/repositories/niamoto_repository.py (tightly coupled data access)
✅ Replaced: Hardcoded table names in 19 plugins with dynamic registry resolution

Validation¶

The system has been validated through:

✅ 89 unit tests covering config models, registry, hierarchy builder, engine
✅ Integration tests for full import workflows (datasets → derived → direct)
✅ Plugin tests demonstrating generic entity support
✅ CLI tests for import/transform/export workflows
✅ Real-world usage with New Caledonia biodiversity data (test-instance)

Next Steps¶

GUI Adaptation: Update import wizard to support dynamic entity definition via import.yml
Migration Guide: Document transition path for existing Niamoto projects
Performance Benchmarking: Compare DuckDB generic engine vs. legacy SQLite importers
Advanced Features: Implement validation rules, conditional enrichments, incremental imports

Status¶

COMPLETE — All 8 implementation phases finished as of 2025-10-13. The generic import system is operational and all legacy importers have been removed. GUI adaptation is the remaining work item.