Current Training Sources¶
Status: Active Audience: Team, AI agents Purpose: Reference list of the datasets currently wired into
ml/scripts/data/build_gold_set.py
This document lists the sources that are actually used by the code when the gold set is rebuilt.
Source of truth:
ml/scripts/data/build_gold_set.py
If this document and the code disagree, the code wins.
How to read this page¶
Each source below is currently registered in SOURCES in
build_gold_set.py. Some sources are active training sources, while a small
subset is explicitly excluded from training by name.
Explicitly excluded from training¶
These sources are still known to the codebase but currently excluded from training:
fia_or_treefia_or_plot
Current source families¶
Product and instance-close datasets¶
guyadiv_treesguyadiv_plotsforestscan_paracou_censusafrique_occafrique_plotsnc_occnc_plotsnc_full_occnc_full_plots
Test and fixture datasets used by the build pipeline¶
gbif_marinegbif_terrestrialcustom_forestchecklistminimaladversarial
IFN France¶
ifn_arbreifn_placetteifn_ecologieifn_floreifn_couvertifn_bois_mortifn_habitat
Standards-based acquisitions¶
taxref_v18ets_occurrence_extets_taxon_extets_measurement_extsplot_headersplot_dtsplot_cwmsplot_metadata
FIA and inventory-style sources¶
fia_treefia_plotfia_fl_treefia_fl_plotfia_or_tree(excluded from training)fia_or_plot(excluded from training)finland_treesfinland_plotsiefc_cataloniaberenty_madagascarafliber_species
Pasoh and research-field datasets¶
pasoh_crownpasoh_leafpasoh_wood
Broad GBIF corpus¶
gbif_spain_ifn3gbif_france_ifngbif_sweden_nfigbif_norway_nfigbif_benin_lamagbif_benin_wari_marogbif_benin_socioecogbif_tanzania_miombogbif_madagascar_grassesgbif_uganda_savannagbif_norway_veggbif_wales_woodlandgbif_poland_botanicalgbif_berlin_botanicalgbif_us_desert_herbgbif_canada_herbariumgbif_japan_plantsgbif_fr_traitsgbif_ethiopia_kafagbif_colombia_wetlandgbif_brazil_forestgbif_argentina_protectedgbif_mexico_floragbif_paramo_colombiagbif_china_herbariumgbif_china_southgbif_philippines_samargbif_india_sundarbansgbif_thailand_atlasgbif_australia_carnarvongbif_nz_pddgbif_austria_herbariumgbif_bulgaria_herbolariogbif_kenya_mangrove
Targeted regional and institutional GBIF batches¶
gbif_targeted_new_caledoniagbif_targeted_guyanegbif_targeted_gabongbif_targeted_cameroongbif_targeted_institutional_gabongbif_targeted_institutional_cameroon
Zenodo and research-oriented datasets¶
zenodo_bci_allometryzenodo_bci_traitszenodo_california_ferpzenodo_china_censuszenodo_china_soilzenodo_forest_inventory_pubzenodo_leaf_traitszenodo_savanna_roots
Notes¶
This page is intentionally grouped by source family, not by exact file path.
The exact paths, labels, and sampling rules remain in
build_gold_set.py.If a new source is integrated in code, update this page in the same change.