Acquisition History and Source Strategy¶
Status: Archive Audience: Team, AI agents Purpose: Preserve the earlier acquisition strategy and candidate-source reasoning that informed the current ML dataset expansion work Canonical current plans:
docs/plans/
Why this document is archived¶
This document merges two earlier files:
the historical acquisition plan
the earlier candidate-source shortlist
It is kept for context because it explains the original reasoning behind:
targeted regional GBIF batches
tropical field priorities
the distinction between product-close data and broader robustness data
the desired long-term shape of
ml/data/silver/
For current planning decisions, use docs/plans/. For the source list actually
used by the code today, use ../current-training-sources.md.
Historical acquisition principles¶
The original acquisition logic was:
do not optimize for raw volume
optimize for benchmark ROI
prioritize data that resembles Niamoto’s real product targets
At that stage, the preferred order was:
real tested instance data
tropical field datasets
targeted regional GBIF
broader neighboring datasets for robustness
Historical target profile¶
The main target families were:
New Caledonia
Gabon / Cameroon
French Guiana / tropical field datasets
datasets from actually tested instances
useful GBIF corpora, but not GBIF volume for its own sake
Consequence:
forest_inventoryshould remain a guardrailbut it should not drive the acquisition roadmap by itself
Historical storage direction¶
The desired direction for ml/data/silver/ was to move away from a flat
directory and toward provenance-oriented grouping such as:
ml/data/silver/
instances/
guyane/
africa_tropical/
gbif_targeted/
That principle still makes sense conceptually, even if the actual storage evolved incrementally.
Historical source prioritization¶
Priority A — very close to the product¶
real datasets from tested instances
tropical forest datasets from Guyane, Gabon, Cameroon, New Caledonia
targeted GBIF exports by region and style
Priority B — useful neighboring datasets¶
large tropical forest networks
vegetation plot networks
African and pan-tropical occurrence or plot databases
Priority C — controlled expansion¶
plant trait datasets
ecologically more distant but still compatible datasets
broader robustness sources
Examples from the original shortlist¶
The original shortlist explicitly highlighted:
tested instance datasets
Paracou / ForestScan
Guyafor network datasets such as Trinité and Trésor
ForestPlots.net / Lopé
RAINBIO
targeted GBIF regional downloads
targeted institutional GBIF subsets
Some of these became concrete acquisition work; others remained strategic options.
What this archive is still useful for¶
Keep this document when you need:
the historical why behind acquisition choices
the original source-selection logic
the reasoning that separated product-close datasets from broader robustness data
Do not use this document as the source of truth for:
current plans
the exact source list used by training
the exact current benchmark setup