Darwin Core Export Guide¶

This guide covers exporting biodiversity data in Darwin Core format, the global standard for sharing biological occurrence information. Niamoto’s Darwin Core export enables seamless integration with GBIF, iDigBio, and other biodiversity data networks.

Overview¶

Darwin Core (DwC) is a body of standards developed by the Biodiversity Information Standards organization (TDWG) for sharing information about biological diversity. The standard provides a stable, straightforward, and flexible framework for occurrence records.

Key Principles¶

Occurrence-Centric: Each record represents a single observation or specimen
Flat Structure: Simple key-value pairs rather than nested hierarchies
Standardized Terms: Predefined vocabulary for field names and values
Event-Based: Links occurrences to collection/observation events
Taxon-Linked: References to taxonomic classification systems

Darwin Core Export Configuration¶

Basic Setup¶

exports:
  - name: dwc_occurrence_json
    enabled: true
    exporter: json_api_exporter    # Reuses JSON API infrastructure

    params:
      output_dir: "exports/dwc/occurrence_json"

      # File naming patterns
      detail_output_pattern: "taxon/{id}_occurrences_dwc.json"
      index_output_pattern: "taxon_index.json"

      # JSON formatting
      json_options:
        indent: 2
        ensure_ascii: false

    groups:
      - group_by: taxon
        # Use Darwin Core transformer
        transformer_plugin: niamoto_to_dwc_occurrence

        # Index file configuration
        index:
          fields:
            - id: taxon_id
            - scientificName: general_info.name.value
            - taxonRank: general_info.rank.value
            - occurrences_count: general_info.occurrences_count.value
            - file_path:
                generator: endpoint_url
                params:
                  base_path: ""
                  pattern: "taxon/{id}_occurrences_dwc.json"

        # Darwin Core transformation parameters
        transformer_params:
          occurrence_list_source: "occurrences"
          mapping:
            # ... (detailed mapping configuration below)

Generated File Structure¶

This configuration creates:

exports/dwc/occurrence_json/taxon/1_occurrences_dwc.json - DwC occurrences for taxon 1
exports/dwc/occurrence_json/taxon/2_occurrences_dwc.json - DwC occurrences for taxon 2
exports/dwc/occurrence_json/taxon_index.json - Index of available taxon files

Important: Only taxa with actual occurrences generate files. Empty taxa are automatically skipped.

Darwin Core Field Mapping¶

Core Record Structure¶

Darwin Core groups fields into logical categories. Here’s the complete mapping configuration:

transformer_params:
  occurrence_list_source: "occurrences"
  mapping:
    # ============================================================
    # Record-Level Terms
    # ============================================================
    type: "Occurrence"
    language: "fr"
    license: "CC-BY-4.0"
    rightsHolder: "Niamoto - Nouvelle-Calédonie"
    datasetName: "Niamoto Export - Nouvelle-Calédonie Flore"
    basisOfRecord: "HumanObservation"

    # ============================================================
    # Occurrence Terms
    # ============================================================
    occurrenceID:
      generator: unique_occurrence_id
      params:
        prefix: "niaocc_"
        source_field: "@source.id"

    individualCount: "1"
    occurrenceStatus: "present"

    # ============================================================
    # Event Terms
    # ============================================================
    eventID:
      generator: unique_event_id
      params:
        prefix: "niaevt_"
        source_field: "@source.id"

    eventDate:
      generator: format_event_date
      params:
        source_field: "@source.month_obs"

    year:
      generator: extract_year
      params:
        source_field: "@source.month_obs"

    month:
      generator: extract_month
      params:
        source_field: "@source.month_obs"

    # ============================================================
    # Location Terms
    # ============================================================
    country: "New Caledonia"
    countryCode: "NC"
    stateProvince: "@source.province"

    minimumElevationInMeters: "@source.elevation"
    maximumElevationInMeters: "@source.elevation"

    decimalLatitude:
      generator: format_coordinates
      params:
        source_field: "@source.geo_pt"
        type: "latitude"

    decimalLongitude:
      generator: format_coordinates
      params:
        source_field: "@source.geo_pt"
        type: "longitude"

    geodeticDatum: "WGS84"

    # ============================================================
    # Identification Terms
    # ============================================================
    identificationID:
      generator: unique_identification_id
      params:
        prefix: "niaid_"
        source_field: "@source.id"

    # ============================================================
    # Taxon Terms
    # ============================================================
    taxonID: "@taxon.taxon_id"
    scientificName: "@source.taxonref"
    kingdom: "Plantae"
    family: "@source.family"
    genus: "@source.genus"

    specificEpithet:
      generator: extract_specific_epithet
      params:
        source_field: "@source.taxonref"

    infraspecificEpithet:
      generator: extract_infraspecific_epithet
      params:
        source_field: "@source.taxonref"

    taxonRank: "@taxon.general_info.rank.value"
    scientificNameAuthorship: "@source.taxonref"

    # ============================================================
    # Measurement Extensions
    # ============================================================
    dynamicProperties:
      generator: format_measurements
      params:
        measurements:
          - field: "@source.dbh"
            name: "diameterAtBreastHeight"
            unit: "cm"
          - field: "@source.height"
            name: "height"
            unit: "m"
          - field: "@source.wood_density"
            name: "woodDensity"
            unit: "g/cm³"
          - field: "@source.bark_thickness"
            name: "barkThickness"
            unit: "mm"
          - field: "@source.leaf_area"
            name: "leafArea"
            unit: "cm²"
          - field: "@source.leaf_thickness"
            name: "leafThickness"
            unit: "µm"
          - field: "@source.leaf_sla"
            name: "specificLeafArea"
            unit: "m²/kg"

    # ============================================================
    # Phenology
    # ============================================================
    reproductiveCondition:
      generator: format_phenology
      params:
        flower_field: "@source.flower"
        fruit_field: "@source.fruit"

    # ============================================================
    # Habitat
    # ============================================================
    habitat:
      generator: format_habitat
      params:
        holdridge_field: "@source.holdridge"
        rainfall_field: "@source.rainfall"
        substrate_field: "@source.in_um"
        forest_field: "@source.in_forest"

    # ============================================================
    # Establishment Means
    # ============================================================
    establishmentMeans:
      generator: map_establishment_means
      params:
        endemic_field: "@taxon.general_info.endemic.value"

Reference System Explained¶

@source vs @taxon References¶

The mapping uses a dual reference system:

@source References: Point to individual occurrence record data

# These come from the 'occurrences' table
decimalLatitude: "@source.geo_pt"      # Individual occurrence location
dbh: "@source.dbh"                     # Tree diameter for this occurrence
month_obs: "@source.month_obs"         # When this occurrence was observed

@taxon References: Point to taxon-level data

# These come from the 'taxon' table (transformed data)
taxonID: "@taxon.taxon_id"             # Taxon identifier
taxonRank: "@taxon.general_info.rank.value"  # Species, genus, etc.
endemic: "@taxon.general_info.endemic.value" # Species endemism status

This enables each occurrence record to include both observation-specific data and taxonomic metadata.

Generator Functions¶

Coordinate Processing¶

decimalLatitude:
  generator: format_coordinates
  params:
    source_field: "@source.geo_pt"
    type: "latitude"

Input: "POINT (165.7683 -21.6461)" (PostGIS format) Output: -21.6461 (decimal degrees)

The generator:

Parses POINT geometry strings
Validates coordinate ranges (lat: -90 to 90, lng: -180 to 180)
Returns null for invalid coordinates

Date Formatting¶

eventDate:
  generator: format_event_date
  params:
    source_field: "@source.month_obs"

Input: 3 (March) Output: "2023-03" (ISO 8601 year-month format)

Input: null or missing Output: null

Scientific Name Parsing¶

specificEpithet:
  generator: extract_specific_epithet
  params:
    source_field: "@source.taxonref"

Input: "Araucaria columnaris (G.Forst.) Hook." Output: "columnaris"

The generator handles:

Binomial nomenclature parsing
Infraspecific epithets (subspecies, varieties)
Author string removal
Hybrid notation (×)

Measurements Formatting¶

dynamicProperties:
  generator: format_measurements
  params:
    measurements:
      - field: "@source.dbh"
        name: "diameterAtBreastHeight"
        unit: "cm"
      - field: "@source.height"
        name: "height"
        unit: "m"

Output:

{
  "diameterAtBreastHeight": {"value": 45.2, "unit": "cm"},
  "height": {"value": 12.5, "unit": "m"}
}

Only includes measurements with non-null values.

Phenology Formatting¶

reproductiveCondition:
  generator: format_phenology
  params:
    flower_field: "@source.flower"
    fruit_field: "@source.fruit"

Input: flower: 1, fruit: 0 Output: "flowering"

Input: flower: 1, fruit: 1 Output: "flowering, fruiting"

Input: flower: 0, fruit: 0 Output: null

Habitat Description¶

habitat:
  generator: format_habitat
  params:
    holdridge_field: "@source.holdridge"
    rainfall_field: "@source.rainfall"
    substrate_field: "@source.in_um"
    forest_field: "@source.in_forest"

Output: "Humid forest on ultramafic substrate, 2500mm annual rainfall"

Combines multiple environmental variables into a human-readable habitat description.

Generated Output Format¶

Individual Occurrence Record¶

{
  "type": "Occurrence",
  "language": "fr",
  "license": "CC-BY-4.0",
  "basisOfRecord": "HumanObservation",

  "occurrenceID": "niaocc_12345",
  "individualCount": "1",
  "occurrenceStatus": "present",

  "eventID": "niaevt_12345",
  "eventDate": "2023-03",
  "year": 2023,
  "month": 3,

  "country": "New Caledonia",
  "countryCode": "NC",
  "stateProvince": "Province Sud",
  "decimalLatitude": -21.6461,
  "decimalLongitude": 165.7683,
  "minimumElevationInMeters": 450,
  "maximumElevationInMeters": 450,
  "geodeticDatum": "WGS84",

  "taxonID": "1",
  "scientificName": "Araucaria columnaris (G.Forst.) Hook.",
  "kingdom": "Plantae",
  "family": "Araucariaceae",
  "genus": "Araucaria",
  "specificEpithet": "columnaris",
  "taxonRank": "species",
  "scientificNameAuthorship": "(G.Forst.) Hook.",

  "dynamicProperties": {
    "diameterAtBreastHeight": {"value": 45.2, "unit": "cm"},
    "height": {"value": 12.5, "unit": "m"},
    "woodDensity": {"value": 0.65, "unit": "g/cm³"}
  },

  "reproductiveCondition": "flowering",
  "habitat": "Humid forest on non-ultramafic substrate, 2500mm annual rainfall",
  "establishmentMeans": "native"
}

Taxon Index File¶

{
  "total": 850,
  "taxon": [
    {
      "id": 1,
      "scientificName": "Araucaria columnaris",
      "taxonRank": "species",
      "occurrences_count": 145,
      "file_path": "taxon/1_occurrences_dwc.json"
    },
    {
      "id": 2,
      "scientificName": "Agathis lanceolata",
      "taxonRank": "species",
      "occurrences_count": 78,
      "file_path": "taxon/2_occurrences_dwc.json"
    }
  ]
}

Data Quality & Validation¶

Automatic Data Cleaning¶

The Darwin Core transformer automatically:

Validates coordinates: Ensures lat/lng are within valid ranges
Filters empty occurrences: Skips taxa with no occurrence records
Handles missing data: Uses null values for missing Darwin Core terms
Standardizes formats: Converts dates, coordinates, and measurements to standard formats

Quality Checks¶

Before export, verify your data quality:

-- Check for valid coordinates
SELECT COUNT(*) FROM occurrences
WHERE geo_pt IS NULL OR geo_pt = '';

-- Check taxonomic coverage
SELECT COUNT(DISTINCT taxon_ref_id) FROM occurrences;

-- Check temporal coverage
SELECT MIN(month_obs), MAX(month_obs) FROM occurrences
WHERE month_obs IS NOT NULL;

Common Issues¶

Missing Coordinates:

{
  "decimalLatitude": null,
  "decimalLongitude": null
}

Solution: Ensure geo_pt field contains valid POINT geometry.

Invalid Dates:

{
  "eventDate": null,
  "year": null,
  "month": null
}

Solution: Check month_obs field contains valid month numbers (1-12).

Missing Taxonomic Information:

{
  "family": null,
  "genus": null
}

Solution: Verify taxonref field contains complete scientific names.

Darwin Core Standards Compliance¶

Required vs Optional Terms¶

Core Required Terms (always included):

type, basisOfRecord, occurrenceID, occurrenceStatus
eventID, country, countryCode
scientificName, kingdom

Recommended Terms (included when available):

decimalLatitude, decimalLongitude, geodeticDatum
eventDate, year, month
family, genus, specificEpithet, taxonRank

Extension Terms (domain-specific):

dynamicProperties - Measurements and morphological data
reproductiveCondition - Phenology information
habitat - Environmental context
establishmentMeans - Native/introduced status

GBIF Compatibility¶

The export format is fully compatible with GBIF ingestion:

Occurrence Core: Follows GBIF occurrence schema
Measurement Extensions: Uses GBIF measurement vocabulary
Identification: Includes identification metadata
Data Quality: Validates required fields and formats

Publishing to GBIF¶

To publish your Darwin Core data to GBIF:

Register as data publisher: Create account at gbif.org
Create dataset: Register your dataset with metadata
Upload data: Use GBIF IPT or direct API upload
Validate: GBIF will validate your Darwin Core compliance
Publish: Data becomes available through GBIF network

Advanced Configuration¶

Custom Field Mapping¶

Add institution-specific fields:

mapping:
  # Standard Darwin Core terms
  scientificName: "@source.taxonref"

  # Custom extensions
  institutionCode: "NC-NIAMOTO"
  collectionCode: "FOREST-PLOTS"

  # Additional measurements
  dynamicProperties:
    generator: format_measurements
    params:
      measurements:
        - field: "@source.custom_trait"
          name: "customMeasurement"
          unit: "unit"

Multiple Export Formats¶

Export the same data in different Darwin Core formats:

exports:
  # JSON format (current)
  - name: dwc_occurrence_json
    exporter: json_api_exporter
    # ... configuration above

  # CSV format (future)
  - name: dwc_occurrence_csv
    exporter: csv_exporter
    transformer_plugin: niamoto_to_dwc_occurrence
    # ... CSV-specific configuration

Filtering Exports¶

Export specific subsets:

transformer_params:
  # Only export certain taxonomic ranks
  filters:
    rank: ["species", "subspecies"]

  # Only export occurrences with coordinates
  required_fields: ["geo_pt"]

  # Date range filtering
  date_range:
    start_year: 2010
    end_year: 2023

Performance Optimization¶

Large Dataset Handling¶

For large occurrence datasets:

params:
  # Enable parallel processing
  performance:
    parallel: true
    max_workers: 6
    batch_size: 100

  # Optimize JSON output
  json_options:
    minify: true
    exclude_null: true
    compress: true  # Generate .gz files

Memory Management¶

Monitor memory usage during export:

# Run export with memory monitoring
niamoto export --target dwc_occurrence_json --verbose

# Check generated file sizes
ls -lh exports/dwc/occurrence_json/

For very large exports, consider chunking by taxonomic groups or geographic regions.

Integration Examples¶

Research Applications¶

Ecological Modeling:

import requests
import pandas as pd

# Load occurrence data
response = requests.get('http://your-site.com/api/taxon/1_occurrences_dwc.json')
occurrences = response.json()

# Convert to pandas DataFrame
df = pd.DataFrame(occurrences)

# Extract coordinates for species distribution modeling
coordinates = df[['decimalLatitude', 'decimalLongitude']].dropna()

Biodiversity Analysis:

library(jsonlite)
library(dplyr)

# Load all taxon data
index <- fromJSON("http://your-site.com/api/taxon_index.json")

# Get species with most occurrences
top_species <- index$taxon %>%
  filter(taxonRank == "species") %>%
  arrange(desc(occurrences_count)) %>%
  head(10)

Data Aggregation Platforms¶

The Darwin Core export integrates seamlessly with:

GBIF: Global biodiversity data network
iDigBio: Integrated Digitized Biocollections
VertNet: Vertebrate specimen networks
Regional nodes: National and regional biodiversity portals

Troubleshooting¶

Common Export Issues¶

No files generated:

0 files generated for dwc_occurrence_json

Check that taxa have associated occurrences
Verify database relationships (taxon_id → taxon_ref_id → occurrences)
Check transformer configuration syntax

Malformed JSON output:

JSON decode error in generated files

Validate generator function outputs
Check for circular references in data
Verify field mapping syntax

Performance issues:

Export taking too long

Enable parallel processing
Add database indexes on join columns
Consider chunking large exports

Validation Tools¶

Validate generated Darwin Core data:

# Check JSON syntax
python -m json.tool taxon/1_occurrences_dwc.json

# Validate Darwin Core compliance (external tool)
dwca-validator validate-json taxon/1_occurrences_dwc.json