# Acquisition History and Source Strategy > Status: Archive > Audience: Team, AI agents > Purpose: Preserve the earlier acquisition strategy and candidate-source > reasoning that informed the current ML dataset expansion work > Canonical current plans: `docs/plans/` ## Why this document is archived This document merges two earlier files: - the historical acquisition plan - the earlier candidate-source shortlist It is kept for context because it explains the original reasoning behind: - targeted regional GBIF batches - tropical field priorities - the distinction between product-close data and broader robustness data - the desired long-term shape of `ml/data/silver/` For current planning decisions, use `docs/plans/`. For the source list actually used by the code today, use [../current-training-sources.md](../current-training-sources.md). ## Historical acquisition principles The original acquisition logic was: - do **not** optimize for raw volume - optimize for **benchmark ROI** - prioritize data that resembles Niamoto’s real product targets At that stage, the preferred order was: 1. real tested instance data 2. tropical field datasets 3. targeted regional GBIF 4. broader neighboring datasets for robustness ## Historical target profile The main target families were: - New Caledonia - Gabon / Cameroon - French Guiana / tropical field datasets - datasets from actually tested instances - useful GBIF corpora, but not GBIF volume for its own sake Consequence: - `forest_inventory` should remain a guardrail - but it should not drive the acquisition roadmap by itself ## Historical storage direction The desired direction for `ml/data/silver/` was to move away from a flat directory and toward provenance-oriented grouping such as: ```text ml/data/silver/ instances/ guyane/ africa_tropical/ gbif_targeted/ ``` That principle still makes sense conceptually, even if the actual storage evolved incrementally. ## Historical source prioritization ### Priority A — very close to the product - real datasets from tested instances - tropical forest datasets from Guyane, Gabon, Cameroon, New Caledonia - targeted GBIF exports by region and style ### Priority B — useful neighboring datasets - large tropical forest networks - vegetation plot networks - African and pan-tropical occurrence or plot databases ### Priority C — controlled expansion - plant trait datasets - ecologically more distant but still compatible datasets - broader robustness sources ## Examples from the original shortlist The original shortlist explicitly highlighted: - tested instance datasets - Paracou / ForestScan - Guyafor network datasets such as Trinité and Trésor - ForestPlots.net / Lopé - RAINBIO - targeted GBIF regional downloads - targeted institutional GBIF subsets Some of these became concrete acquisition work; others remained strategic options. ## What this archive is still useful for Keep this document when you need: - the historical why behind acquisition choices - the original source-selection logic - the reasoning that separated product-close datasets from broader robustness data Do not use this document as the source of truth for: - current plans - the exact source list used by training - the exact current benchmark setup