← Glossary
// THE CONTRACT

Holdout

A group of users deliberately excluded from a treatment so the team can measure long-term incremental impact. The control variant for experiments whose effects only show up over weeks or months.

// what it is

A holdout is a randomly assigned slice of the audience that does not receive a treatment, kept separate so that long-term effects can be measured against a clean baseline. Standard A/B tests run for days or weeks; holdouts run for months or quarters. The point is to answer "did this campaign / this feature / this re-engagement series actually generate incremental value, or would those users have converted anyway?"

Holdouts are how mature growth teams measure things that A/B tests cannot — the cumulative effect of a year of email re-engagement, the lifetime impact of a loyalty program, the decay rate of brand-marketing campaigns. The key discipline is treating the holdout as sacred: do not shrink it because you are leaving "free revenue on the table," do not contaminate it with messaging through other channels, and do not end it early when the directional read goes the way you hoped.

// when this matters

When to use it

Use a holdout when the effect of the treatment unfolds over weeks or months — re-engagement programs, loyalty mechanics, brand campaigns, lifecycle email. Standard A/B tests answer "did this lift the immediate metric"; holdouts answer "did this generate incremental long-term value."

// deeper

What this looks like in practice

Holdout sizing trades off statistical power against revenue you give up. A 10% holdout on a $1M/year program gives up roughly $100K (if the program works) but produces a confident incrementality read; a 1% holdout costs almost nothing but is too small to draw conclusions from. Most teams settle in the 5 to 10% range. The number is a business decision, not a statistical one — what is the cost of being wrong about this program for another year.

Universal holdouts vs. program-specific holdouts. A universal holdout is a single 5% of users who never receive any growth treatment — the cleanest measure of all programs combined. A program-specific holdout excludes users from one treatment only — measures that program in isolation but lets the team contaminate the holdout via every other channel. Universal is more rigorous; program-specific is more practical for portfolios with many overlapping treatments. Pick deliberately and document the choice.

The discipline that ruins most holdouts is leakage. The holdout users still receive product updates, see organic posts, get exposed to billboards, and respond to public pricing changes. The thing the holdout protects against is the additional treatment, not all influence. Specify what is held out and what is not in the contract, and assume any unmeasured channel is contaminating both arms — the holdout still tells you something useful, but the read is "incremental over baseline channels," not "all attribution from this program."

// example

A worked example

// EXAMPLE

A consumer-app team holds 5% of eligible users out of all push-notification re-engagement for six months. At month six, the held-out cohort has 8% lower 90-day retention than the treated cohort — meaning the push program drove an incremental 8 percentage points of retention, worth quantifiable lifetime revenue. Without the holdout, the team would have argued forever about whether push was working or whether it was just nagging engaged users.

// pitfalls

Common mistakes

  • Shrinking the holdout when revenue gets tight.A holdout that gets cut from 10% to 2% the first quarter the team needs to "hit the number" stops being a holdout. The whole point is the discipline of leaving it alone for the full window.
  • Reading the holdout early and stopping.Long-term effects are non-monotonic; the read at month 1 may reverse by month 4. Pre-commit the read window in the contract; do not peek the way you do not peek at A/B tests.
  • Ignoring contamination across channels.If your holdout group also sees ads, gets pricing-page updates, and reads public blog content, the holdout measures incrementality over those channels — not absolute lift. Specify the boundary in the contract or the verdict is misleading.
Tool comparison. See how Xi compares with Statsig, Eppo, and PostHog.
// related

Related terms

Pick a hypothesis. Vocabulary done.

The fastest way to learn this vocabulary is to commit one experiment. The contract takes about five minutes to write.