Anonymous Holdout Rework (2026-03-21)¶
Status: Experiment Audience: Team, AI agents Purpose: Replace the old synthetic anonymous holdout with a real, diverse, anonymized benchmark
Context¶
The previous anonymous holdout was synthetic, overly easy, and permanently inflated the benchmark. This rework replaced it with a holdout built from real gold-set columns anonymized with generic names.
Problem¶
The old anonymous holdout was not testing anything useful:
86 fully synthetic entries
76% concentrated on only 2 concepts
only 10 concepts covered out of ~45 coarse concepts at the time
the values branch could score highly by predicting generic continuous-float behavior
10% of
ProductScorewas therefore being inflated by a weak benchmark
Implemented solution¶
New _build_anonymous_holdout()¶
In ml/scripts/data/build_gold_set.py:
stratified sampling of real columns by
concept_coarseminimum of 2 entries per concept
fixed seed
42generic randomized names from a large anonymous name pool
dedicated quality tag:
gold_anonymous
Removal of the old mechanism¶
removed
HEADER_VARIANTS["anonymous"]removed the legacy synthetic anonymous generation branches
Values-only anonymous metric¶
In ml/scripts/eval/evaluate.py:
_evaluate_holdout_score()now supportsreturn_models=Truethe anonymous block exposes an informative values-only metric
the
anonymousbucket inProductScorestill uses the full fusion pipeline
Results¶
New holdout distribution¶
Metric |
Before |
After |
|---|---|---|
Anonymous entries |
86 |
122 |
Concepts covered |
10 |
61 |
Quality |
|
|
Distribution |
76% on 2 concepts |
2 per concept (uniform floor) |
Unique names |
31 |
122 |
Score changes¶
Metric |
Before |
After |
Delta |
|---|---|---|---|
ProductScore |
81.82 |
81.32 |
-0.50 |
GlobalScore |
80.79 |
81.72 |
+0.93 |
Anonymous holdout |
100.0 |
91.02 |
-8.98 |
Anonymous values-only |
n/a |
94.3% (115/122) |
— |
Detailed ProductScore buckets:
Bucket |
Before |
After |
Delta |
|---|---|---|---|
tropical_field (30%) |
69.01 |
68.71 |
-0.30 |
research_traits (15%) |
75.49 |
77.08 |
+1.59 |
gbif_core_standard (20%) |
96.02 |
97.36 |
+1.34 |
gbif_extended (10%) |
88.18 |
88.22 |
+0.04 |
en_field (15%) |
78.46 |
78.30 |
-0.16 |
anonymous (10%) |
100.0 |
91.02 |
-8.98 |
Interpretation¶
The anonymous holdout became informative instead of artificially perfect.
The values branch alone was still strong on anonymized real columns.
The small
ProductScoredrop came entirely from replacing an inflated 100% bucket with a more realistic score.The other buckets remained stable within normal run-to-run variance.
Bug fixed during the session¶
gold_anonymous entries were initially treated as real_records because they
were not synthetic. That polluted other holdouts and the primary protocol.
The fix explicitly excluded is_anonymous entries from real_records and
synthetic_records inside the protocol evaluation.
Modified files¶
File |
Change |
|---|---|
|
new anonymous holdout builder, old synthetic mechanism removed |
|
values-only anonymous metric, protocol exclusion fix |
|
regenerated |
Reproduction commands¶
uv run python -m ml.scripts.data.build_gold_set
uv run python -m ml.scripts.eval.evaluate --model all --metric product-score