Anonymous Holdout Rework (2026-03-21)¶

Status: Experiment Audience: Team, AI agents Purpose: Replace the old synthetic anonymous holdout with a real, diverse, anonymized benchmark

Context¶

The previous anonymous holdout was synthetic, overly easy, and permanently inflated the benchmark. This rework replaced it with a holdout built from real gold-set columns anonymized with generic names.

Problem¶

The old anonymous holdout was not testing anything useful:

86 fully synthetic entries
76% concentrated on only 2 concepts
only 10 concepts covered out of ~45 coarse concepts at the time
the values branch could score highly by predicting generic continuous-float behavior
10% of ProductScore was therefore being inflated by a weak benchmark

Implemented solution¶

New `_build_anonymous_holdout()`¶

In ml/scripts/data/build_gold_set.py:

stratified sampling of real columns by concept_coarse
minimum of 2 entries per concept
fixed seed 42
generic randomized names from a large anonymous name pool
dedicated quality tag: gold_anonymous

Removal of the old mechanism¶

removed HEADER_VARIANTS["anonymous"]
removed the legacy synthetic anonymous generation branches

Values-only anonymous metric¶

In ml/scripts/eval/evaluate.py:

_evaluate_holdout_score() now supports return_models=True
the anonymous block exposes an informative values-only metric
the anonymous bucket in ProductScore still uses the full fusion pipeline

Results¶

New holdout distribution¶

Metric	Before	After
Anonymous entries	86	122
Concepts covered	10	61
Quality	`synthetic`	`gold_anonymous`
Distribution	76% on 2 concepts	2 per concept (uniform floor)
Unique names	31	122

Score changes¶

Metric	Before	After	Delta
ProductScore	81.82	81.32	-0.50
GlobalScore	80.79	81.72	+0.93
Anonymous holdout	100.0	91.02	-8.98
Anonymous values-only	n/a	94.3% (115/122)	—

Detailed ProductScore buckets:

Bucket	Before	After	Delta
tropical_field (30%)	69.01	68.71	-0.30
research_traits (15%)	75.49	77.08	+1.59
gbif_core_standard (20%)	96.02	97.36	+1.34
gbif_extended (10%)	88.18	88.22	+0.04
en_field (15%)	78.46	78.30	-0.16
anonymous (10%)	100.0	91.02	-8.98

Interpretation¶

The anonymous holdout became informative instead of artificially perfect.
The values branch alone was still strong on anonymized real columns.
The small ProductScore drop came entirely from replacing an inflated 100% bucket with a more realistic score.
The other buckets remained stable within normal run-to-run variance.

Bug fixed during the session¶

gold_anonymous entries were initially treated as real_records because they were not synthetic. That polluted other holdouts and the primary protocol. The fix explicitly excluded is_anonymous entries from real_records and synthetic_records inside the protocol evaluation.

Modified files¶

File	Change
`ml/scripts/data/build_gold_set.py`	new anonymous holdout builder, old synthetic mechanism removed
`ml/scripts/eval/evaluate.py`	values-only anonymous metric, protocol exclusion fix
`ml/data/gold_set.json`	regenerated

Reproduction commands¶

uv run python -m ml.scripts.data.build_gold_set

uv run python -m ml.scripts.eval.evaluate --model all --metric product-score