# Darwin Core Export Guide

This guide covers exporting biodiversity data in Darwin Core format, the global standard for sharing biological occurrence information. Niamoto's Darwin Core export enables seamless integration with GBIF, iDigBio, and other biodiversity data networks.

## Overview

Darwin Core (DwC) is a body of standards developed by the Biodiversity Information Standards organization (TDWG) for sharing information about biological diversity. The standard provides a stable, straightforward, and flexible framework for occurrence records.

### Key Principles

1. **Occurrence-Centric**: Each record represents a single observation or specimen
2. **Flat Structure**: Simple key-value pairs rather than nested hierarchies
3. **Standardized Terms**: Predefined vocabulary for field names and values
4. **Event-Based**: Links occurrences to collection/observation events
5. **Taxon-Linked**: References to taxonomic classification systems

## Darwin Core Export Configuration

### Basic Setup

```yaml
exports:
  - name: dwc_occurrence_json
    enabled: true
    exporter: json_api_exporter    # Reuses JSON API infrastructure

    params:
      output_dir: "exports/dwc/occurrence_json"

      # File naming patterns
      detail_output_pattern: "taxon/{id}_occurrences_dwc.json"
      index_output_pattern: "taxon_index.json"

      # JSON formatting
      json_options:
        indent: 2
        ensure_ascii: false

    groups:
      - group_by: taxon
        # Use Darwin Core transformer
        transformer_plugin: niamoto_to_dwc_occurrence

        # Index file configuration
        index:
          fields:
            - id: taxon_id
            - scientificName: general_info.name.value
            - taxonRank: general_info.rank.value
            - occurrences_count: general_info.occurrences_count.value
            - file_path:
                generator: endpoint_url
                params:
                  base_path: ""
                  pattern: "taxon/{id}_occurrences_dwc.json"

        # Darwin Core transformation parameters
        transformer_params:
          occurrence_list_source: "occurrences"
          mapping:
            # ... (detailed mapping configuration below)
```

### Generated File Structure

This configuration creates:
- `exports/dwc/occurrence_json/taxon/1_occurrences_dwc.json` - DwC occurrences for taxon 1
- `exports/dwc/occurrence_json/taxon/2_occurrences_dwc.json` - DwC occurrences for taxon 2
- `exports/dwc/occurrence_json/taxon_index.json` - Index of available taxon files

**Important**: Only taxa with actual occurrences generate files. Empty taxa are automatically skipped.

## Darwin Core Field Mapping

### Core Record Structure

Darwin Core groups fields into logical categories. Here's the complete mapping configuration:

```yaml
transformer_params:
  occurrence_list_source: "occurrences"
  mapping:
    # ============================================================
    # Record-Level Terms
    # ============================================================
    type: "Occurrence"
    language: "fr"
    license: "CC-BY-4.0"
    rightsHolder: "Niamoto - Nouvelle-Calédonie"
    datasetName: "Niamoto Export - Nouvelle-Calédonie Flore"
    basisOfRecord: "HumanObservation"

    # ============================================================
    # Occurrence Terms
    # ============================================================
    occurrenceID:
      generator: unique_occurrence_id
      params:
        prefix: "niaocc_"
        source_field: "@source.id"

    individualCount: "1"
    occurrenceStatus: "present"

    # ============================================================
    # Event Terms
    # ============================================================
    eventID:
      generator: unique_event_id
      params:
        prefix: "niaevt_"
        source_field: "@source.id"

    eventDate:
      generator: format_event_date
      params:
        source_field: "@source.month_obs"

    year:
      generator: extract_year
      params:
        source_field: "@source.month_obs"

    month:
      generator: extract_month
      params:
        source_field: "@source.month_obs"

    # ============================================================
    # Location Terms
    # ============================================================
    country: "New Caledonia"
    countryCode: "NC"
    stateProvince: "@source.province"

    minimumElevationInMeters: "@source.elevation"
    maximumElevationInMeters: "@source.elevation"

    decimalLatitude:
      generator: format_coordinates
      params:
        source_field: "@source.geo_pt"
        type: "latitude"

    decimalLongitude:
      generator: format_coordinates
      params:
        source_field: "@source.geo_pt"
        type: "longitude"

    geodeticDatum: "WGS84"

    # ============================================================
    # Identification Terms
    # ============================================================
    identificationID:
      generator: unique_identification_id
      params:
        prefix: "niaid_"
        source_field: "@source.id"

    # ============================================================
    # Taxon Terms
    # ============================================================
    taxonID: "@taxon.taxon_id"
    scientificName: "@source.taxonref"
    kingdom: "Plantae"
    family: "@source.family"
    genus: "@source.genus"

    specificEpithet:
      generator: extract_specific_epithet
      params:
        source_field: "@source.taxonref"

    infraspecificEpithet:
      generator: extract_infraspecific_epithet
      params:
        source_field: "@source.taxonref"

    taxonRank: "@taxon.general_info.rank.value"
    scientificNameAuthorship: "@source.taxonref"

    # ============================================================
    # Measurement Extensions
    # ============================================================
    dynamicProperties:
      generator: format_measurements
      params:
        measurements:
          - field: "@source.dbh"
            name: "diameterAtBreastHeight"
            unit: "cm"
          - field: "@source.height"
            name: "height"
            unit: "m"
          - field: "@source.wood_density"
            name: "woodDensity"
            unit: "g/cm³"
          - field: "@source.bark_thickness"
            name: "barkThickness"
            unit: "mm"
          - field: "@source.leaf_area"
            name: "leafArea"
            unit: "cm²"
          - field: "@source.leaf_thickness"
            name: "leafThickness"
            unit: "µm"
          - field: "@source.leaf_sla"
            name: "specificLeafArea"
            unit: "m²/kg"

    # ============================================================
    # Phenology
    # ============================================================
    reproductiveCondition:
      generator: format_phenology
      params:
        flower_field: "@source.flower"
        fruit_field: "@source.fruit"

    # ============================================================
    # Habitat
    # ============================================================
    habitat:
      generator: format_habitat
      params:
        holdridge_field: "@source.holdridge"
        rainfall_field: "@source.rainfall"
        substrate_field: "@source.in_um"
        forest_field: "@source.in_forest"

    # ============================================================
    # Establishment Means
    # ============================================================
    establishmentMeans:
      generator: map_establishment_means
      params:
        endemic_field: "@taxon.general_info.endemic.value"
```

## Reference System Explained

### @source vs @taxon References

The mapping uses a dual reference system:

**@source References**: Point to individual occurrence record data
```yaml
# These come from the 'occurrences' table
decimalLatitude: "@source.geo_pt"      # Individual occurrence location
dbh: "@source.dbh"                     # Tree diameter for this occurrence
month_obs: "@source.month_obs"         # When this occurrence was observed
```

**@taxon References**: Point to taxon-level data
```yaml
# These come from the 'taxon' table (transformed data)
taxonID: "@taxon.taxon_id"             # Taxon identifier
taxonRank: "@taxon.general_info.rank.value"  # Species, genus, etc.
endemic: "@taxon.general_info.endemic.value" # Species endemism status
```

This enables each occurrence record to include both observation-specific data and taxonomic metadata.

## Generator Functions

### Coordinate Processing

```yaml
decimalLatitude:
  generator: format_coordinates
  params:
    source_field: "@source.geo_pt"
    type: "latitude"
```

**Input**: `"POINT (165.7683 -21.6461)"` (PostGIS format)
**Output**: `-21.6461` (decimal degrees)

The generator:
- Parses POINT geometry strings
- Validates coordinate ranges (lat: -90 to 90, lng: -180 to 180)
- Returns null for invalid coordinates

### Date Formatting

```yaml
eventDate:
  generator: format_event_date
  params:
    source_field: "@source.month_obs"
```

**Input**: `3` (March)
**Output**: `"2023-03"` (ISO 8601 year-month format)

**Input**: `null` or missing
**Output**: `null`

### Scientific Name Parsing

```yaml
specificEpithet:
  generator: extract_specific_epithet
  params:
    source_field: "@source.taxonref"
```

**Input**: `"Araucaria columnaris (G.Forst.) Hook."`
**Output**: `"columnaris"`

The generator handles:
- Binomial nomenclature parsing
- Infraspecific epithets (subspecies, varieties)
- Author string removal
- Hybrid notation (×)

### Measurements Formatting

```yaml
dynamicProperties:
  generator: format_measurements
  params:
    measurements:
      - field: "@source.dbh"
        name: "diameterAtBreastHeight"
        unit: "cm"
      - field: "@source.height"
        name: "height"
        unit: "m"
```

**Output**:
```json
{
  "diameterAtBreastHeight": {"value": 45.2, "unit": "cm"},
  "height": {"value": 12.5, "unit": "m"}
}
```

Only includes measurements with non-null values.

### Phenology Formatting

```yaml
reproductiveCondition:
  generator: format_phenology
  params:
    flower_field: "@source.flower"
    fruit_field: "@source.fruit"
```

**Input**: `flower: 1, fruit: 0`
**Output**: `"flowering"`

**Input**: `flower: 1, fruit: 1`
**Output**: `"flowering, fruiting"`

**Input**: `flower: 0, fruit: 0`
**Output**: `null`

### Habitat Description

```yaml
habitat:
  generator: format_habitat
  params:
    holdridge_field: "@source.holdridge"
    rainfall_field: "@source.rainfall"
    substrate_field: "@source.in_um"
    forest_field: "@source.in_forest"
```

**Output**: `"Humid forest on ultramafic substrate, 2500mm annual rainfall"`

Combines multiple environmental variables into a human-readable habitat description.

## Generated Output Format

### Individual Occurrence Record

```json
{
  "type": "Occurrence",
  "language": "fr",
  "license": "CC-BY-4.0",
  "basisOfRecord": "HumanObservation",

  "occurrenceID": "niaocc_12345",
  "individualCount": "1",
  "occurrenceStatus": "present",

  "eventID": "niaevt_12345",
  "eventDate": "2023-03",
  "year": 2023,
  "month": 3,

  "country": "New Caledonia",
  "countryCode": "NC",
  "stateProvince": "Province Sud",
  "decimalLatitude": -21.6461,
  "decimalLongitude": 165.7683,
  "minimumElevationInMeters": 450,
  "maximumElevationInMeters": 450,
  "geodeticDatum": "WGS84",

  "taxonID": "1",
  "scientificName": "Araucaria columnaris (G.Forst.) Hook.",
  "kingdom": "Plantae",
  "family": "Araucariaceae",
  "genus": "Araucaria",
  "specificEpithet": "columnaris",
  "taxonRank": "species",
  "scientificNameAuthorship": "(G.Forst.) Hook.",

  "dynamicProperties": {
    "diameterAtBreastHeight": {"value": 45.2, "unit": "cm"},
    "height": {"value": 12.5, "unit": "m"},
    "woodDensity": {"value": 0.65, "unit": "g/cm³"}
  },

  "reproductiveCondition": "flowering",
  "habitat": "Humid forest on non-ultramafic substrate, 2500mm annual rainfall",
  "establishmentMeans": "native"
}
```

### Taxon Index File

```json
{
  "total": 850,
  "taxon": [
    {
      "id": 1,
      "scientificName": "Araucaria columnaris",
      "taxonRank": "species",
      "occurrences_count": 145,
      "file_path": "taxon/1_occurrences_dwc.json"
    },
    {
      "id": 2,
      "scientificName": "Agathis lanceolata",
      "taxonRank": "species",
      "occurrences_count": 78,
      "file_path": "taxon/2_occurrences_dwc.json"
    }
  ]
}
```

## Data Quality & Validation

### Automatic Data Cleaning

The Darwin Core transformer automatically:

1. **Validates coordinates**: Ensures lat/lng are within valid ranges
2. **Filters empty occurrences**: Skips taxa with no occurrence records
3. **Handles missing data**: Uses null values for missing Darwin Core terms
4. **Standardizes formats**: Converts dates, coordinates, and measurements to standard formats

### Quality Checks

Before export, verify your data quality:

```sql
-- Check for valid coordinates
SELECT COUNT(*) FROM occurrences
WHERE geo_pt IS NULL OR geo_pt = '';

-- Check taxonomic coverage
SELECT COUNT(DISTINCT taxon_ref_id) FROM occurrences;

-- Check temporal coverage
SELECT MIN(month_obs), MAX(month_obs) FROM occurrences
WHERE month_obs IS NOT NULL;
```

### Common Issues

**Missing Coordinates**:
```json
{
  "decimalLatitude": null,
  "decimalLongitude": null
}
```
Solution: Ensure geo_pt field contains valid POINT geometry.

**Invalid Dates**:
```json
{
  "eventDate": null,
  "year": null,
  "month": null
}
```
Solution: Check month_obs field contains valid month numbers (1-12).

**Missing Taxonomic Information**:
```json
{
  "family": null,
  "genus": null
}
```
Solution: Verify taxonref field contains complete scientific names.

## Darwin Core Standards Compliance

### Required vs Optional Terms

**Core Required Terms** (always included):
- `type`, `basisOfRecord`, `occurrenceID`, `occurrenceStatus`
- `eventID`, `country`, `countryCode`
- `scientificName`, `kingdom`

**Recommended Terms** (included when available):
- `decimalLatitude`, `decimalLongitude`, `geodeticDatum`
- `eventDate`, `year`, `month`
- `family`, `genus`, `specificEpithet`, `taxonRank`

**Extension Terms** (domain-specific):
- `dynamicProperties` - Measurements and morphological data
- `reproductiveCondition` - Phenology information
- `habitat` - Environmental context
- `establishmentMeans` - Native/introduced status

### GBIF Compatibility

The export format is fully compatible with GBIF ingestion:

1. **Occurrence Core**: Follows GBIF occurrence schema
2. **Measurement Extensions**: Uses GBIF measurement vocabulary
3. **Identification**: Includes identification metadata
4. **Data Quality**: Validates required fields and formats

### Publishing to GBIF

To publish your Darwin Core data to GBIF:

1. **Register as data publisher**: Create account at gbif.org
2. **Create dataset**: Register your dataset with metadata
3. **Upload data**: Use GBIF IPT or direct API upload
4. **Validate**: GBIF will validate your Darwin Core compliance
5. **Publish**: Data becomes available through GBIF network

## Advanced Configuration

### Custom Field Mapping

Add institution-specific fields:

```yaml
mapping:
  # Standard Darwin Core terms
  scientificName: "@source.taxonref"

  # Custom extensions
  institutionCode: "NC-NIAMOTO"
  collectionCode: "FOREST-PLOTS"

  # Additional measurements
  dynamicProperties:
    generator: format_measurements
    params:
      measurements:
        - field: "@source.custom_trait"
          name: "customMeasurement"
          unit: "unit"
```

### Multiple Export Formats

Export the same data in different Darwin Core formats:

```yaml
exports:
  # JSON format (current)
  - name: dwc_occurrence_json
    exporter: json_api_exporter
    # ... configuration above

  # CSV format (future)
  - name: dwc_occurrence_csv
    exporter: csv_exporter
    transformer_plugin: niamoto_to_dwc_occurrence
    # ... CSV-specific configuration
```

### Filtering Exports

Export specific subsets:

```yaml
transformer_params:
  # Only export certain taxonomic ranks
  filters:
    rank: ["species", "subspecies"]

  # Only export occurrences with coordinates
  required_fields: ["geo_pt"]

  # Date range filtering
  date_range:
    start_year: 2010
    end_year: 2023
```

## Performance Optimization

### Large Dataset Handling

For large occurrence datasets:

```yaml
params:
  # Enable parallel processing
  performance:
    parallel: true
    max_workers: 6
    batch_size: 100

  # Optimize JSON output
  json_options:
    minify: true
    exclude_null: true
    compress: true  # Generate .gz files
```

### Memory Management

Monitor memory usage during export:

```bash
# Run export with memory monitoring
niamoto export --target dwc_occurrence_json --verbose

# Check generated file sizes
ls -lh exports/dwc/occurrence_json/
```

For very large exports, consider chunking by taxonomic groups or geographic regions.

## Integration Examples

### Research Applications

**Ecological Modeling**:
```python
import requests
import pandas as pd

# Load occurrence data
response = requests.get('http://your-site.com/api/taxon/1_occurrences_dwc.json')
occurrences = response.json()

# Convert to pandas DataFrame
df = pd.DataFrame(occurrences)

# Extract coordinates for species distribution modeling
coordinates = df[['decimalLatitude', 'decimalLongitude']].dropna()
```

**Biodiversity Analysis**:
```r
library(jsonlite)
library(dplyr)

# Load all taxon data
index <- fromJSON("http://your-site.com/api/taxon_index.json")

# Get species with most occurrences
top_species <- index$taxon %>%
  filter(taxonRank == "species") %>%
  arrange(desc(occurrences_count)) %>%
  head(10)
```

### Data Aggregation Platforms

The Darwin Core export integrates seamlessly with:

- **GBIF**: Global biodiversity data network
- **iDigBio**: Integrated Digitized Biocollections
- **VertNet**: Vertebrate specimen networks
- **Regional nodes**: National and regional biodiversity portals

## Troubleshooting

### Common Export Issues

**No files generated**:
```
0 files generated for dwc_occurrence_json
```
- Check that taxa have associated occurrences
- Verify database relationships (taxon_id → taxon_ref_id → occurrences)
- Check transformer configuration syntax

**Malformed JSON output**:
```
JSON decode error in generated files
```
- Validate generator function outputs
- Check for circular references in data
- Verify field mapping syntax

**Performance issues**:
```
Export taking too long
```
- Enable parallel processing
- Add database indexes on join columns
- Consider chunking large exports

### Validation Tools

Validate generated Darwin Core data:

```bash
# Check JSON syntax
python -m json.tool taxon/1_occurrences_dwc.json

# Validate Darwin Core compliance (external tool)
dwca-validator validate-json taxon/1_occurrences_dwc.json
```

## Related Documentation

- [Reference overview](../../06-reference/README.md) - General JSON API exports
- [Plugin Reference](../README.md) - Transformer plugin development
- [Reference overview](../../06-reference/README.md) - YAML configuration syntax
- [Publish](../../02-user-guide/publish.md) - Using exported data

For GBIF-specific publishing guidance, see the [GBIF IPT User Manual](https://ipt.gbif.org/).