# Current Training Sources > Status: Active > Audience: Team, AI agents > Purpose: Reference list of the datasets currently wired into > `ml/scripts/data/build_gold_set.py` This document lists the sources that are **actually used by the code** when the gold set is rebuilt. Source of truth: - `ml/scripts/data/build_gold_set.py` If this document and the code disagree, the code wins. ## How to read this page Each source below is currently registered in `SOURCES` in `build_gold_set.py`. Some sources are active training sources, while a small subset is explicitly excluded from training by name. ## Explicitly excluded from training These sources are still known to the codebase but currently excluded from training: - `fia_or_tree` - `fia_or_plot` ## Current source families ### Product and instance-close datasets - `guyadiv_trees` - `guyadiv_plots` - `forestscan_paracou_census` - `afrique_occ` - `afrique_plots` - `nc_occ` - `nc_plots` - `nc_full_occ` - `nc_full_plots` ### Test and fixture datasets used by the build pipeline - `gbif_marine` - `gbif_terrestrial` - `custom_forest` - `checklist` - `minimal` - `adversarial` ### IFN France - `ifn_arbre` - `ifn_placette` - `ifn_ecologie` - `ifn_flore` - `ifn_couvert` - `ifn_bois_mort` - `ifn_habitat` ### Standards-based acquisitions - `taxref_v18` - `ets_occurrence_ext` - `ets_taxon_ext` - `ets_measurement_ext` - `splot_header` - `splot_dt` - `splot_cwm` - `splot_metadata` ### FIA and inventory-style sources - `fia_tree` - `fia_plot` - `fia_fl_tree` - `fia_fl_plot` - `fia_or_tree` (excluded from training) - `fia_or_plot` (excluded from training) - `finland_trees` - `finland_plots` - `iefc_catalonia` - `berenty_madagascar` - `afliber_species` ### Pasoh and research-field datasets - `pasoh_crown` - `pasoh_leaf` - `pasoh_wood` ### Broad GBIF corpus - `gbif_spain_ifn3` - `gbif_france_ifn` - `gbif_sweden_nfi` - `gbif_norway_nfi` - `gbif_benin_lama` - `gbif_benin_wari_maro` - `gbif_benin_socioeco` - `gbif_tanzania_miombo` - `gbif_madagascar_grasses` - `gbif_uganda_savanna` - `gbif_norway_veg` - `gbif_wales_woodland` - `gbif_poland_botanical` - `gbif_berlin_botanical` - `gbif_us_desert_herb` - `gbif_canada_herbarium` - `gbif_japan_plants` - `gbif_fr_traits` - `gbif_ethiopia_kafa` - `gbif_colombia_wetland` - `gbif_brazil_forest` - `gbif_argentina_protected` - `gbif_mexico_flora` - `gbif_paramo_colombia` - `gbif_china_herbarium` - `gbif_china_south` - `gbif_philippines_samar` - `gbif_india_sundarbans` - `gbif_thailand_atlas` - `gbif_australia_carnarvon` - `gbif_nz_pdd` - `gbif_austria_herbarium` - `gbif_bulgaria_herbolario` - `gbif_kenya_mangrove` ### Targeted regional and institutional GBIF batches - `gbif_targeted_new_caledonia` - `gbif_targeted_guyane` - `gbif_targeted_gabon` - `gbif_targeted_cameroon` - `gbif_targeted_institutional_gabon` - `gbif_targeted_institutional_cameroon` ### Zenodo and research-oriented datasets - `zenodo_bci_allometry` - `zenodo_bci_traits` - `zenodo_california_ferp` - `zenodo_china_census` - `zenodo_china_soil` - `zenodo_forest_inventory_pub` - `zenodo_leaf_traits` - `zenodo_savanna_roots` ## Notes - This page is intentionally grouped by source family, not by exact file path. - The exact paths, labels, and sampling rules remain in `build_gold_set.py`. - If a new source is integrated in code, update this page in the same change.