Darwin Core Export Guide¶
This guide covers exporting biodiversity data in Darwin Core format, the global standard for sharing biological occurrence information. Niamoto’s Darwin Core export enables seamless integration with GBIF, iDigBio, and other biodiversity data networks.
Overview¶
Darwin Core (DwC) is a body of standards developed by the Biodiversity Information Standards organization (TDWG) for sharing information about biological diversity. The standard provides a stable, straightforward, and flexible framework for occurrence records.
Key Principles¶
Occurrence-Centric: Each record represents a single observation or specimen
Flat Structure: Simple key-value pairs rather than nested hierarchies
Standardized Terms: Predefined vocabulary for field names and values
Event-Based: Links occurrences to collection/observation events
Taxon-Linked: References to taxonomic classification systems
Darwin Core Export Configuration¶
Basic Setup¶
exports:
- name: dwc_occurrence_json
enabled: true
exporter: json_api_exporter # Reuses JSON API infrastructure
params:
output_dir: "exports/dwc/occurrence_json"
# File naming patterns
detail_output_pattern: "taxon/{id}_occurrences_dwc.json"
index_output_pattern: "taxon_index.json"
# JSON formatting
json_options:
indent: 2
ensure_ascii: false
groups:
- group_by: taxon
# Use Darwin Core transformer
transformer_plugin: niamoto_to_dwc_occurrence
# Index file configuration
index:
fields:
- id: taxon_id
- scientificName: general_info.name.value
- taxonRank: general_info.rank.value
- occurrences_count: general_info.occurrences_count.value
- file_path:
generator: endpoint_url
params:
base_path: ""
pattern: "taxon/{id}_occurrences_dwc.json"
# Darwin Core transformation parameters
transformer_params:
occurrence_list_source: "occurrences"
mapping:
# ... (detailed mapping configuration below)
Generated File Structure¶
This configuration creates:
exports/dwc/occurrence_json/taxon/1_occurrences_dwc.json- DwC occurrences for taxon 1exports/dwc/occurrence_json/taxon/2_occurrences_dwc.json- DwC occurrences for taxon 2exports/dwc/occurrence_json/taxon_index.json- Index of available taxon files
Important: Only taxa with actual occurrences generate files. Empty taxa are automatically skipped.
Darwin Core Field Mapping¶
Core Record Structure¶
Darwin Core groups fields into logical categories. Here’s the complete mapping configuration:
transformer_params:
occurrence_list_source: "occurrences"
mapping:
# ============================================================
# Record-Level Terms
# ============================================================
type: "Occurrence"
language: "fr"
license: "CC-BY-4.0"
rightsHolder: "Niamoto - Nouvelle-Calédonie"
datasetName: "Niamoto Export - Nouvelle-Calédonie Flore"
basisOfRecord: "HumanObservation"
# ============================================================
# Occurrence Terms
# ============================================================
occurrenceID:
generator: unique_occurrence_id
params:
prefix: "niaocc_"
source_field: "@source.id"
individualCount: "1"
occurrenceStatus: "present"
# ============================================================
# Event Terms
# ============================================================
eventID:
generator: unique_event_id
params:
prefix: "niaevt_"
source_field: "@source.id"
eventDate:
generator: format_event_date
params:
source_field: "@source.month_obs"
year:
generator: extract_year
params:
source_field: "@source.month_obs"
month:
generator: extract_month
params:
source_field: "@source.month_obs"
# ============================================================
# Location Terms
# ============================================================
country: "New Caledonia"
countryCode: "NC"
stateProvince: "@source.province"
minimumElevationInMeters: "@source.elevation"
maximumElevationInMeters: "@source.elevation"
decimalLatitude:
generator: format_coordinates
params:
source_field: "@source.geo_pt"
type: "latitude"
decimalLongitude:
generator: format_coordinates
params:
source_field: "@source.geo_pt"
type: "longitude"
geodeticDatum: "WGS84"
# ============================================================
# Identification Terms
# ============================================================
identificationID:
generator: unique_identification_id
params:
prefix: "niaid_"
source_field: "@source.id"
# ============================================================
# Taxon Terms
# ============================================================
taxonID: "@taxon.taxon_id"
scientificName: "@source.taxonref"
kingdom: "Plantae"
family: "@source.family"
genus: "@source.genus"
specificEpithet:
generator: extract_specific_epithet
params:
source_field: "@source.taxonref"
infraspecificEpithet:
generator: extract_infraspecific_epithet
params:
source_field: "@source.taxonref"
taxonRank: "@taxon.general_info.rank.value"
scientificNameAuthorship: "@source.taxonref"
# ============================================================
# Measurement Extensions
# ============================================================
dynamicProperties:
generator: format_measurements
params:
measurements:
- field: "@source.dbh"
name: "diameterAtBreastHeight"
unit: "cm"
- field: "@source.height"
name: "height"
unit: "m"
- field: "@source.wood_density"
name: "woodDensity"
unit: "g/cm³"
- field: "@source.bark_thickness"
name: "barkThickness"
unit: "mm"
- field: "@source.leaf_area"
name: "leafArea"
unit: "cm²"
- field: "@source.leaf_thickness"
name: "leafThickness"
unit: "µm"
- field: "@source.leaf_sla"
name: "specificLeafArea"
unit: "m²/kg"
# ============================================================
# Phenology
# ============================================================
reproductiveCondition:
generator: format_phenology
params:
flower_field: "@source.flower"
fruit_field: "@source.fruit"
# ============================================================
# Habitat
# ============================================================
habitat:
generator: format_habitat
params:
holdridge_field: "@source.holdridge"
rainfall_field: "@source.rainfall"
substrate_field: "@source.in_um"
forest_field: "@source.in_forest"
# ============================================================
# Establishment Means
# ============================================================
establishmentMeans:
generator: map_establishment_means
params:
endemic_field: "@taxon.general_info.endemic.value"
Reference System Explained¶
@source vs @taxon References¶
The mapping uses a dual reference system:
@source References: Point to individual occurrence record data
# These come from the 'occurrences' table
decimalLatitude: "@source.geo_pt" # Individual occurrence location
dbh: "@source.dbh" # Tree diameter for this occurrence
month_obs: "@source.month_obs" # When this occurrence was observed
@taxon References: Point to taxon-level data
# These come from the 'taxon' table (transformed data)
taxonID: "@taxon.taxon_id" # Taxon identifier
taxonRank: "@taxon.general_info.rank.value" # Species, genus, etc.
endemic: "@taxon.general_info.endemic.value" # Species endemism status
This enables each occurrence record to include both observation-specific data and taxonomic metadata.
Generator Functions¶
Coordinate Processing¶
decimalLatitude:
generator: format_coordinates
params:
source_field: "@source.geo_pt"
type: "latitude"
Input: "POINT (165.7683 -21.6461)" (PostGIS format)
Output: -21.6461 (decimal degrees)
The generator:
Parses POINT geometry strings
Validates coordinate ranges (lat: -90 to 90, lng: -180 to 180)
Returns null for invalid coordinates
Date Formatting¶
eventDate:
generator: format_event_date
params:
source_field: "@source.month_obs"
Input: 3 (March)
Output: "2023-03" (ISO 8601 year-month format)
Input: null or missing
Output: null
Scientific Name Parsing¶
specificEpithet:
generator: extract_specific_epithet
params:
source_field: "@source.taxonref"
Input: "Araucaria columnaris (G.Forst.) Hook."
Output: "columnaris"
The generator handles:
Binomial nomenclature parsing
Infraspecific epithets (subspecies, varieties)
Author string removal
Hybrid notation (×)
Measurements Formatting¶
dynamicProperties:
generator: format_measurements
params:
measurements:
- field: "@source.dbh"
name: "diameterAtBreastHeight"
unit: "cm"
- field: "@source.height"
name: "height"
unit: "m"
Output:
{
"diameterAtBreastHeight": {"value": 45.2, "unit": "cm"},
"height": {"value": 12.5, "unit": "m"}
}
Only includes measurements with non-null values.
Phenology Formatting¶
reproductiveCondition:
generator: format_phenology
params:
flower_field: "@source.flower"
fruit_field: "@source.fruit"
Input: flower: 1, fruit: 0
Output: "flowering"
Input: flower: 1, fruit: 1
Output: "flowering, fruiting"
Input: flower: 0, fruit: 0
Output: null
Habitat Description¶
habitat:
generator: format_habitat
params:
holdridge_field: "@source.holdridge"
rainfall_field: "@source.rainfall"
substrate_field: "@source.in_um"
forest_field: "@source.in_forest"
Output: "Humid forest on ultramafic substrate, 2500mm annual rainfall"
Combines multiple environmental variables into a human-readable habitat description.
Generated Output Format¶
Individual Occurrence Record¶
{
"type": "Occurrence",
"language": "fr",
"license": "CC-BY-4.0",
"basisOfRecord": "HumanObservation",
"occurrenceID": "niaocc_12345",
"individualCount": "1",
"occurrenceStatus": "present",
"eventID": "niaevt_12345",
"eventDate": "2023-03",
"year": 2023,
"month": 3,
"country": "New Caledonia",
"countryCode": "NC",
"stateProvince": "Province Sud",
"decimalLatitude": -21.6461,
"decimalLongitude": 165.7683,
"minimumElevationInMeters": 450,
"maximumElevationInMeters": 450,
"geodeticDatum": "WGS84",
"taxonID": "1",
"scientificName": "Araucaria columnaris (G.Forst.) Hook.",
"kingdom": "Plantae",
"family": "Araucariaceae",
"genus": "Araucaria",
"specificEpithet": "columnaris",
"taxonRank": "species",
"scientificNameAuthorship": "(G.Forst.) Hook.",
"dynamicProperties": {
"diameterAtBreastHeight": {"value": 45.2, "unit": "cm"},
"height": {"value": 12.5, "unit": "m"},
"woodDensity": {"value": 0.65, "unit": "g/cm³"}
},
"reproductiveCondition": "flowering",
"habitat": "Humid forest on non-ultramafic substrate, 2500mm annual rainfall",
"establishmentMeans": "native"
}
Taxon Index File¶
{
"total": 850,
"taxon": [
{
"id": 1,
"scientificName": "Araucaria columnaris",
"taxonRank": "species",
"occurrences_count": 145,
"file_path": "taxon/1_occurrences_dwc.json"
},
{
"id": 2,
"scientificName": "Agathis lanceolata",
"taxonRank": "species",
"occurrences_count": 78,
"file_path": "taxon/2_occurrences_dwc.json"
}
]
}
Data Quality & Validation¶
Automatic Data Cleaning¶
The Darwin Core transformer automatically:
Validates coordinates: Ensures lat/lng are within valid ranges
Filters empty occurrences: Skips taxa with no occurrence records
Handles missing data: Uses null values for missing Darwin Core terms
Standardizes formats: Converts dates, coordinates, and measurements to standard formats
Quality Checks¶
Before export, verify your data quality:
-- Check for valid coordinates
SELECT COUNT(*) FROM occurrences
WHERE geo_pt IS NULL OR geo_pt = '';
-- Check taxonomic coverage
SELECT COUNT(DISTINCT taxon_ref_id) FROM occurrences;
-- Check temporal coverage
SELECT MIN(month_obs), MAX(month_obs) FROM occurrences
WHERE month_obs IS NOT NULL;
Common Issues¶
Missing Coordinates:
{
"decimalLatitude": null,
"decimalLongitude": null
}
Solution: Ensure geo_pt field contains valid POINT geometry.
Invalid Dates:
{
"eventDate": null,
"year": null,
"month": null
}
Solution: Check month_obs field contains valid month numbers (1-12).
Missing Taxonomic Information:
{
"family": null,
"genus": null
}
Solution: Verify taxonref field contains complete scientific names.
Darwin Core Standards Compliance¶
Required vs Optional Terms¶
Core Required Terms (always included):
type,basisOfRecord,occurrenceID,occurrenceStatuseventID,country,countryCodescientificName,kingdom
Recommended Terms (included when available):
decimalLatitude,decimalLongitude,geodeticDatumeventDate,year,monthfamily,genus,specificEpithet,taxonRank
Extension Terms (domain-specific):
dynamicProperties- Measurements and morphological datareproductiveCondition- Phenology informationhabitat- Environmental contextestablishmentMeans- Native/introduced status
GBIF Compatibility¶
The export format is fully compatible with GBIF ingestion:
Occurrence Core: Follows GBIF occurrence schema
Measurement Extensions: Uses GBIF measurement vocabulary
Identification: Includes identification metadata
Data Quality: Validates required fields and formats
Publishing to GBIF¶
To publish your Darwin Core data to GBIF:
Register as data publisher: Create account at gbif.org
Create dataset: Register your dataset with metadata
Upload data: Use GBIF IPT or direct API upload
Validate: GBIF will validate your Darwin Core compliance
Publish: Data becomes available through GBIF network
Advanced Configuration¶
Custom Field Mapping¶
Add institution-specific fields:
mapping:
# Standard Darwin Core terms
scientificName: "@source.taxonref"
# Custom extensions
institutionCode: "NC-NIAMOTO"
collectionCode: "FOREST-PLOTS"
# Additional measurements
dynamicProperties:
generator: format_measurements
params:
measurements:
- field: "@source.custom_trait"
name: "customMeasurement"
unit: "unit"
Multiple Export Formats¶
Export the same data in different Darwin Core formats:
exports:
# JSON format (current)
- name: dwc_occurrence_json
exporter: json_api_exporter
# ... configuration above
# CSV format (future)
- name: dwc_occurrence_csv
exporter: csv_exporter
transformer_plugin: niamoto_to_dwc_occurrence
# ... CSV-specific configuration
Filtering Exports¶
Export specific subsets:
transformer_params:
# Only export certain taxonomic ranks
filters:
rank: ["species", "subspecies"]
# Only export occurrences with coordinates
required_fields: ["geo_pt"]
# Date range filtering
date_range:
start_year: 2010
end_year: 2023
Performance Optimization¶
Large Dataset Handling¶
For large occurrence datasets:
params:
# Enable parallel processing
performance:
parallel: true
max_workers: 6
batch_size: 100
# Optimize JSON output
json_options:
minify: true
exclude_null: true
compress: true # Generate .gz files
Memory Management¶
Monitor memory usage during export:
# Run export with memory monitoring
niamoto export --target dwc_occurrence_json --verbose
# Check generated file sizes
ls -lh exports/dwc/occurrence_json/
For very large exports, consider chunking by taxonomic groups or geographic regions.
Integration Examples¶
Research Applications¶
Ecological Modeling:
import requests
import pandas as pd
# Load occurrence data
response = requests.get('http://your-site.com/api/taxon/1_occurrences_dwc.json')
occurrences = response.json()
# Convert to pandas DataFrame
df = pd.DataFrame(occurrences)
# Extract coordinates for species distribution modeling
coordinates = df[['decimalLatitude', 'decimalLongitude']].dropna()
Biodiversity Analysis:
library(jsonlite)
library(dplyr)
# Load all taxon data
index <- fromJSON("http://your-site.com/api/taxon_index.json")
# Get species with most occurrences
top_species <- index$taxon %>%
filter(taxonRank == "species") %>%
arrange(desc(occurrences_count)) %>%
head(10)
Data Aggregation Platforms¶
The Darwin Core export integrates seamlessly with:
GBIF: Global biodiversity data network
iDigBio: Integrated Digitized Biocollections
VertNet: Vertebrate specimen networks
Regional nodes: National and regional biodiversity portals
Troubleshooting¶
Common Export Issues¶
No files generated:
0 files generated for dwc_occurrence_json
Check that taxa have associated occurrences
Verify database relationships (taxon_id → taxon_ref_id → occurrences)
Check transformer configuration syntax
Malformed JSON output:
JSON decode error in generated files
Validate generator function outputs
Check for circular references in data
Verify field mapping syntax
Performance issues:
Export taking too long
Enable parallel processing
Add database indexes on join columns
Consider chunking large exports
Validation Tools¶
Validate generated Darwin Core data:
# Check JSON syntax
python -m json.tool taxon/1_occurrences_dwc.json
# Validate Darwin Core compliance (external tool)
dwca-validator validate-json taxon/1_occurrences_dwc.json