Anonymous Holdout Rework (2026-03-21)

Status: Experiment Audience: Team, AI agents Purpose: Replace the old synthetic anonymous holdout with a real, diverse, anonymized benchmark

Context

The previous anonymous holdout was synthetic, overly easy, and permanently inflated the benchmark. This rework replaced it with a holdout built from real gold-set columns anonymized with generic names.

Problem

The old anonymous holdout was not testing anything useful:

  • 86 fully synthetic entries

  • 76% concentrated on only 2 concepts

  • only 10 concepts covered out of ~45 coarse concepts at the time

  • the values branch could score highly by predicting generic continuous-float behavior

  • 10% of ProductScore was therefore being inflated by a weak benchmark

Implemented solution

New _build_anonymous_holdout()

In ml/scripts/data/build_gold_set.py:

  • stratified sampling of real columns by concept_coarse

  • minimum of 2 entries per concept

  • fixed seed 42

  • generic randomized names from a large anonymous name pool

  • dedicated quality tag: gold_anonymous

Removal of the old mechanism

  • removed HEADER_VARIANTS["anonymous"]

  • removed the legacy synthetic anonymous generation branches

Values-only anonymous metric

In ml/scripts/eval/evaluate.py:

  • _evaluate_holdout_score() now supports return_models=True

  • the anonymous block exposes an informative values-only metric

  • the anonymous bucket in ProductScore still uses the full fusion pipeline

Results

New holdout distribution

Metric

Before

After

Anonymous entries

86

122

Concepts covered

10

61

Quality

synthetic

gold_anonymous

Distribution

76% on 2 concepts

2 per concept (uniform floor)

Unique names

31

122

Score changes

Metric

Before

After

Delta

ProductScore

81.82

81.32

-0.50

GlobalScore

80.79

81.72

+0.93

Anonymous holdout

100.0

91.02

-8.98

Anonymous values-only

n/a

94.3% (115/122)

Detailed ProductScore buckets:

Bucket

Before

After

Delta

tropical_field (30%)

69.01

68.71

-0.30

research_traits (15%)

75.49

77.08

+1.59

gbif_core_standard (20%)

96.02

97.36

+1.34

gbif_extended (10%)

88.18

88.22

+0.04

en_field (15%)

78.46

78.30

-0.16

anonymous (10%)

100.0

91.02

-8.98

Interpretation

  • The anonymous holdout became informative instead of artificially perfect.

  • The values branch alone was still strong on anonymized real columns.

  • The small ProductScore drop came entirely from replacing an inflated 100% bucket with a more realistic score.

  • The other buckets remained stable within normal run-to-run variance.

Bug fixed during the session

gold_anonymous entries were initially treated as real_records because they were not synthetic. That polluted other holdouts and the primary protocol. The fix explicitly excluded is_anonymous entries from real_records and synthetic_records inside the protocol evaluation.

Modified files

File

Change

ml/scripts/data/build_gold_set.py

new anonymous holdout builder, old synthetic mechanism removed

ml/scripts/eval/evaluate.py

values-only anonymous metric, protocol exclusion fix

ml/data/gold_set.json

regenerated

Reproduction commands

uv run python -m ml.scripts.data.build_gold_set

uv run python -m ml.scripts.eval.evaluate --model all --metric product-score