# ADR 0004 — Generic Import System: Configuration-Driven Architecture

*Status: Adopted (2025-10-13)*

## Context

Prior to this refactoring, Niamoto's import system was based on four specialized importers:

- `TaxonomyImporter` — Hardcoded to import to `taxon_ref` table with nested set calculations
- `PlotImporter` — Hardcoded to import to `plot_ref` table
- `ShapeImporter` — Hardcoded to import to `shape_ref` table
- `OccurrenceImporter` — Hardcoded to import to `occurrences` table

### Problems with the Legacy System

1. **Inflexibility**: Could only import pre-defined entity types to fixed tables
2. **Code Duplication**: Each importer repeated CSV validation, table creation, and geometry handling logic
3. **Configuration Coupling**: Transform/Export plugins directly accessed `taxon_ref`, `plot_ref`, `shape_ref` — couldn't work with other entity types
4. **Nested Set Overhead**: Taxonomy hierarchies required expensive lft/rght recalculation on every import
5. **Limited Extensibility**: Adding new entity types (habitats, sites, custom references) required new specialized importers

The system couldn't support common use cases like:
- Importing a third-party taxonomy as a reference
- Creating hierarchies from derived data (e.g., extracting taxonomy from occurrence records)
- Defining custom reference entities for project-specific needs

## Decision

We implemented a **Generic Import System** driven by declarative YAML configuration (`import.yml`), eliminating all specialized importers in favor of a unified engine.

### Core Architecture

1. **Entity Registry** (`core/imports/registry.py`)
   - Central metadata service describing all entities in the system
   - Stores entity type, physical table name, schema, links, and aliases
   - Replaces hardcoded assumptions about `taxon_ref`, `plot_ref`, `shape_ref`
   - Provides introspection API for Transform/Export/GUI services

2. **Configuration Models** (`core/imports/config_models.py`)
   - Pydantic models define entity types: `Dataset`, `Reference` (hierarchical/spatial)
   - Validates connector types, schema fields, hierarchy strategies, enrichment configs
   - Supports derived references extracted from datasets

3. **Generic Import Engine** (`core/imports/engine.py`)
   - Orchestrates import execution plan: Datasets → Derived References → Direct References
   - Uses DuckDB connectors (`read_csv_auto`, spatial extension) for efficient ingestion
   - Validates data against schema, builds hierarchies, applies enrichments

4. **Hierarchy Builder** (`core/imports/hierarchy_builder.py`)
   - Supports multiple strategies: adjacency list, nested set (legacy compatibility)
   - For derived references: extracts hierarchy from source data columns
   - Uses DuckDB recursive CTEs for adjacency list construction
   - Hash-based ID generation ensures stable IDs across reimports

### Three Entity Types

**Datasets** — Source data tables (e.g., occurrences, observations)
```yaml
entities:
  datasets:
    occurrences:
      connector:
        type: file
        format: csv
        path: imports/occurrences.csv
```

**Derived References** — Hierarchies extracted from datasets
```yaml
entities:
  references:
    taxonomy:
      kind: hierarchical
      connector:
        type: derived
        source: occurrences
        extraction:
          levels:
            - name: family
              column: family
            - name: genus
              column: genus
```

**Direct References** — Hierarchies loaded from files
```yaml
entities:
  references:
    plots:
      kind: hierarchical
      connector:
        type: file
        format: csv
        path: imports/plots.csv
```

### Plugin Genericization (Phase 8)

All 19 plugins were refactored to:
- Accept `EntityRegistry` instead of `Config`/`Database` objects
- Resolve table names dynamically via registry instead of hardcoding
- Support arbitrary entity types (not just `taxon_ref`, `plot_ref`, `shape_ref`)

**Example transformation**:
```python
# Before (coupled to taxon_ref)
def load(self, config: Config):
    query = "SELECT * FROM taxon_ref WHERE..."

# After (generic via registry)
def load(self, registry: EntityRegistry, entity_name: str):
    table = registry.get_table_name(entity_name)
    query = f"SELECT * FROM {table} WHERE..."
```

Refactored plugins include:
- **Loaders**: `direct_reference`, `join_table`, `nested_set`, `adjacency_list`, `stats_loader`
- **Transformers**: `database_aggregator`, `field_aggregator`, `top_ranking`, `direct_attribute`, `geospatial_extractor`, `multi_column_extractor`, `niamoto_to_dwc_occurrence`, `shape_processor`
- **Exporters**: `dwc_archive_exporter`, `html_page_exporter`, `index_generator`, `json_api_exporter`

## Implementation Phases (Completed)

### Phase 0-3: Foundation
- ✅ Pydantic configuration models
- ✅ DuckDB integration (see ADR 0001)
- ✅ Entity Registry implementation
- ✅ Generic import engine with execution plan

### Phase 4-5: Hierarchy Systems
- ✅ Adjacency list builder with hash-based IDs
- ✅ Derived reference extraction from datasets (see ADR 0003)
- ✅ Multi-source spatial references

### Phase 6-7: Service Integration
- ✅ CLI migration to use EntityRegistry
- ✅ Transform/Export services consume registry
- ✅ GUI API endpoints adapted for dynamic entities

### Phase 8: Plugin Genericization
- ✅ All 19 plugins refactored to accept EntityRegistry
- ✅ Dynamic table resolution removes hardcoded assumptions
- ✅ Plugins now work with any entity type

## Consequences

### Positive

1. **Flexibility**: Can define any entity type in `import.yml` without code changes
2. **Maintainability**: Single import engine eliminates code duplication
3. **Performance**: DuckDB direct ingestion and recursive CTEs are faster than SQLite+pandas
4. **Extensibility**: New entity types, connectors, or hierarchy strategies are configuration changes
5. **Decoupling**: Plugins resolve tables via registry — no hardcoded dependencies
6. **Stability**: Hash-based IDs ensure referential integrity across reimports

### Challenges & Migration Requirements

1. **Breaking Change**: Existing projects must migrate from SQLite schema to DuckDB
2. **Configuration Migration**: Old CLI workflows must be converted to `import.yml` format
3. **Plugin Updates**: Any custom plugins need EntityRegistry integration
4. **Documentation**: Extensive examples needed for new configuration syntax
5. **GUI Adaptation**: Interface must support dynamic entity definition (in progress)

### Technical Debt Retired

- ❌ Removed: `core/components/imports/` (TaxonomyImporter, PlotImporter, ShapeImporter, OccurrenceImporter)
- ❌ Removed: `core/models/models.py` (rigid SQLAlchemy models for taxon_ref, plot_ref, shape_ref)
- ❌ Removed: `core/repositories/niamoto_repository.py` (tightly coupled data access)
- ✅ Replaced: Hardcoded table names in 19 plugins with dynamic registry resolution

## Related ADRs

- **ADR 0001** — DuckDB Adoption: Enables efficient generic imports with `read_csv_auto`, recursive CTEs
- **ADR 0002** — Retirement of Specialized Importers: Documents transition strategy from legacy system
- **ADR 0003** — Derived References with DuckDB CTEs: Explains hierarchy extraction architecture

## Validation

The system has been validated through:
- ✅ 89 unit tests covering config models, registry, hierarchy builder, engine
- ✅ Integration tests for full import workflows (datasets → derived → direct)
- ✅ Plugin tests demonstrating generic entity support
- ✅ CLI tests for import/transform/export workflows
- ✅ Real-world usage with New Caledonia biodiversity data (test-instance)

## Next Steps

1. **GUI Adaptation**: Update import wizard to support dynamic entity definition via `import.yml`
2. **Migration Guide**: Document transition path for existing Niamoto projects
3. **Performance Benchmarking**: Compare DuckDB generic engine vs. legacy SQLite importers
4. **Advanced Features**: Implement validation rules, conditional enrichments, incremental imports

## Status

**COMPLETE** — All 8 implementation phases finished as of 2025-10-13. The generic import system is operational and all legacy importers have been removed. GUI adaptation is the remaining work item.