ADR 0004 — Generic Import System: Configuration-Driven Architecture

Status: Adopted (2025-10-13)

Context

Prior to this refactoring, Niamoto’s import system was based on four specialized importers:

  • TaxonomyImporter — Hardcoded to import to taxon_ref table with nested set calculations

  • PlotImporter — Hardcoded to import to plot_ref table

  • ShapeImporter — Hardcoded to import to shape_ref table

  • OccurrenceImporter — Hardcoded to import to occurrences table

Problems with the Legacy System

  1. Inflexibility: Could only import pre-defined entity types to fixed tables

  2. Code Duplication: Each importer repeated CSV validation, table creation, and geometry handling logic

  3. Configuration Coupling: Transform/Export plugins directly accessed taxon_ref, plot_ref, shape_ref — couldn’t work with other entity types

  4. Nested Set Overhead: Taxonomy hierarchies required expensive lft/rght recalculation on every import

  5. Limited Extensibility: Adding new entity types (habitats, sites, custom references) required new specialized importers

The system couldn’t support common use cases like:

  • Importing a third-party taxonomy as a reference

  • Creating hierarchies from derived data (e.g., extracting taxonomy from occurrence records)

  • Defining custom reference entities for project-specific needs

Decision

We implemented a Generic Import System driven by declarative YAML configuration (import.yml), eliminating all specialized importers in favor of a unified engine.

Core Architecture

  1. Entity Registry (core/imports/registry.py)

    • Central metadata service describing all entities in the system

    • Stores entity type, physical table name, schema, links, and aliases

    • Replaces hardcoded assumptions about taxon_ref, plot_ref, shape_ref

    • Provides introspection API for Transform/Export/GUI services

  2. Configuration Models (core/imports/config_models.py)

    • Pydantic models define entity types: Dataset, Reference (hierarchical/spatial)

    • Validates connector types, schema fields, hierarchy strategies, enrichment configs

    • Supports derived references extracted from datasets

  3. Generic Import Engine (core/imports/engine.py)

    • Orchestrates import execution plan: Datasets → Derived References → Direct References

    • Uses DuckDB connectors (read_csv_auto, spatial extension) for efficient ingestion

    • Validates data against schema, builds hierarchies, applies enrichments

  4. Hierarchy Builder (core/imports/hierarchy_builder.py)

    • Supports multiple strategies: adjacency list, nested set (legacy compatibility)

    • For derived references: extracts hierarchy from source data columns

    • Uses DuckDB recursive CTEs for adjacency list construction

    • Hash-based ID generation ensures stable IDs across reimports

Three Entity Types

Datasets — Source data tables (e.g., occurrences, observations)

entities:
  datasets:
    occurrences:
      connector:
        type: file
        format: csv
        path: imports/occurrences.csv

Derived References — Hierarchies extracted from datasets

entities:
  references:
    taxonomy:
      kind: hierarchical
      connector:
        type: derived
        source: occurrences
        extraction:
          levels:
            - name: family
              column: family
            - name: genus
              column: genus

Direct References — Hierarchies loaded from files

entities:
  references:
    plots:
      kind: hierarchical
      connector:
        type: file
        format: csv
        path: imports/plots.csv

Plugin Genericization (Phase 8)

All 19 plugins were refactored to:

  • Accept EntityRegistry instead of Config/Database objects

  • Resolve table names dynamically via registry instead of hardcoding

  • Support arbitrary entity types (not just taxon_ref, plot_ref, shape_ref)

Example transformation:

# Before (coupled to taxon_ref)
def load(self, config: Config):
    query = "SELECT * FROM taxon_ref WHERE..."

# After (generic via registry)
def load(self, registry: EntityRegistry, entity_name: str):
    table = registry.get_table_name(entity_name)
    query = f"SELECT * FROM {table} WHERE..."

Refactored plugins include:

  • Loaders: direct_reference, join_table, nested_set, adjacency_list, stats_loader

  • Transformers: database_aggregator, field_aggregator, top_ranking, direct_attribute, geospatial_extractor, multi_column_extractor, niamoto_to_dwc_occurrence, shape_processor

  • Exporters: dwc_archive_exporter, html_page_exporter, index_generator, json_api_exporter

Implementation Phases (Completed)

Phase 0-3: Foundation

  • ✅ Pydantic configuration models

  • ✅ DuckDB integration (see ADR 0001)

  • ✅ Entity Registry implementation

  • ✅ Generic import engine with execution plan

Phase 4-5: Hierarchy Systems

  • ✅ Adjacency list builder with hash-based IDs

  • ✅ Derived reference extraction from datasets (see ADR 0003)

  • ✅ Multi-source spatial references

Phase 6-7: Service Integration

  • ✅ CLI migration to use EntityRegistry

  • ✅ Transform/Export services consume registry

  • ✅ GUI API endpoints adapted for dynamic entities

Phase 8: Plugin Genericization

  • ✅ All 19 plugins refactored to accept EntityRegistry

  • ✅ Dynamic table resolution removes hardcoded assumptions

  • ✅ Plugins now work with any entity type

Consequences

Positive

  1. Flexibility: Can define any entity type in import.yml without code changes

  2. Maintainability: Single import engine eliminates code duplication

  3. Performance: DuckDB direct ingestion and recursive CTEs are faster than SQLite+pandas

  4. Extensibility: New entity types, connectors, or hierarchy strategies are configuration changes

  5. Decoupling: Plugins resolve tables via registry — no hardcoded dependencies

  6. Stability: Hash-based IDs ensure referential integrity across reimports

Challenges & Migration Requirements

  1. Breaking Change: Existing projects must migrate from SQLite schema to DuckDB

  2. Configuration Migration: Old CLI workflows must be converted to import.yml format

  3. Plugin Updates: Any custom plugins need EntityRegistry integration

  4. Documentation: Extensive examples needed for new configuration syntax

  5. GUI Adaptation: Interface must support dynamic entity definition (in progress)

Technical Debt Retired

  • ❌ Removed: core/components/imports/ (TaxonomyImporter, PlotImporter, ShapeImporter, OccurrenceImporter)

  • ❌ Removed: core/models/models.py (rigid SQLAlchemy models for taxon_ref, plot_ref, shape_ref)

  • ❌ Removed: core/repositories/niamoto_repository.py (tightly coupled data access)

  • ✅ Replaced: Hardcoded table names in 19 plugins with dynamic registry resolution

Validation

The system has been validated through:

  • ✅ 89 unit tests covering config models, registry, hierarchy builder, engine

  • ✅ Integration tests for full import workflows (datasets → derived → direct)

  • ✅ Plugin tests demonstrating generic entity support

  • ✅ CLI tests for import/transform/export workflows

  • ✅ Real-world usage with New Caledonia biodiversity data (test-instance)

Next Steps

  1. GUI Adaptation: Update import wizard to support dynamic entity definition via import.yml

  2. Migration Guide: Document transition path for existing Niamoto projects

  3. Performance Benchmarking: Compare DuckDB generic engine vs. legacy SQLite importers

  4. Advanced Features: Implement validation rules, conditional enrichments, incremental imports

Status

COMPLETE — All 8 implementation phases finished as of 2025-10-13. The generic import system is operational and all legacy importers have been removed. GUI adaptation is the remaining work item.