# Anonymous Holdout Rework (2026-03-21)

> Status: Experiment
> Audience: Team, AI agents
> Purpose: Replace the old synthetic anonymous holdout with a real, diverse,
> anonymized benchmark

## Context

The previous anonymous holdout was synthetic, overly easy, and permanently
inflated the benchmark. This rework replaced it with a holdout built from real
gold-set columns anonymized with generic names.

## Problem

The old anonymous holdout was not testing anything useful:

- 86 fully synthetic entries
- 76% concentrated on only 2 concepts
- only 10 concepts covered out of ~45 coarse concepts at the time
- the values branch could score highly by predicting generic continuous-float behavior
- 10% of `ProductScore` was therefore being inflated by a weak benchmark

## Implemented solution

### New `_build_anonymous_holdout()`

In `ml/scripts/data/build_gold_set.py`:

- stratified sampling of real columns by `concept_coarse`
- minimum of 2 entries per concept
- fixed seed `42`
- generic randomized names from a large anonymous name pool
- dedicated quality tag: `gold_anonymous`

### Removal of the old mechanism

- removed `HEADER_VARIANTS["anonymous"]`
- removed the legacy synthetic anonymous generation branches

### Values-only anonymous metric

In `ml/scripts/eval/evaluate.py`:

- `_evaluate_holdout_score()` now supports `return_models=True`
- the anonymous block exposes an informative values-only metric
- the `anonymous` bucket in `ProductScore` still uses the full fusion pipeline

## Results

### New holdout distribution

| Metric | Before | After |
|--------|:------:|:-----:|
| Anonymous entries | 86 | **122** |
| Concepts covered | 10 | **61** |
| Quality | `synthetic` | `gold_anonymous` |
| Distribution | 76% on 2 concepts | **2 per concept (uniform floor)** |
| Unique names | 31 | **122** |

### Score changes

| Metric | Before | After | Delta |
|--------|:------:|:-----:|:-----:|
| **ProductScore** | 81.82 | **81.32** | **-0.50** |
| **GlobalScore** | 80.79 | **81.72** | **+0.93** |
| Anonymous holdout | 100.0 | **91.02** | **-8.98** |
| Anonymous values-only | n/a | **94.3%** (115/122) | — |

Detailed `ProductScore` buckets:

| Bucket | Before | After | Delta |
|--------|:------:|:-----:|:-----:|
| tropical_field (30%) | 69.01 | 68.71 | -0.30 |
| research_traits (15%) | 75.49 | 77.08 | +1.59 |
| gbif_core_standard (20%) | 96.02 | 97.36 | +1.34 |
| gbif_extended (10%) | 88.18 | 88.22 | +0.04 |
| en_field (15%) | 78.46 | 78.30 | -0.16 |
| anonymous (10%) | 100.0 | **91.02** | **-8.98** |

## Interpretation

- The anonymous holdout became informative instead of artificially perfect.
- The values branch alone was still strong on anonymized real columns.
- The small `ProductScore` drop came entirely from replacing an inflated 100%
  bucket with a more realistic score.
- The other buckets remained stable within normal run-to-run variance.

### Bug fixed during the session

`gold_anonymous` entries were initially treated as `real_records` because they
were not `synthetic`. That polluted other holdouts and the primary protocol.
The fix explicitly excluded `is_anonymous` entries from `real_records` and
`synthetic_records` inside the protocol evaluation.

## Modified files

| File | Change |
|------|--------|
| `ml/scripts/data/build_gold_set.py` | new anonymous holdout builder, old synthetic mechanism removed |
| `ml/scripts/eval/evaluate.py` | values-only anonymous metric, protocol exclusion fix |
| `ml/data/gold_set.json` | regenerated |

## Reproduction commands

```bash
uv run python -m ml.scripts.data.build_gold_set

uv run python -m ml.scripts.eval.evaluate --model all --metric product-score
```