False positive
An experiment result that calls a real win when no real effect exists — the variant got lucky in this draw of users. Controlled by the significance threshold (α); the cost of being wrong determines how strict to set it.
A false positive — also called a Type I error — is when an experiment crosses the statistical-significance threshold even though the true effect is zero. Random variation in user behavior produces some noise; a chunk of that noise will, by chance, look like signal. The 5% (p < 0.05) convention means: if the true effect were zero, the test would call it a "win" 5% of the time anyway. That 5% is your false-positive rate by construction.
False positives are not rare freak events; they are baked into the math. If a team runs 20 A/B tests on changes that secretly do nothing, on average one will declare a "winning" lift just by luck. Multiply that across a year of experiments and false-positive ships explain a noticeable share of the "we shipped it but the metric never moved" story. Pre-commit the significance threshold in the contract; do not negotiate it after the data arrives.
When to use it
Reason about false positives any time you set the significance threshold for an experiment. Stricter thresholds (α = 0.01) for high-stakes irreversible changes; looser ones (α = 0.10) for cheap, reversible UI tweaks. Pick deliberately, in the contract.
What this looks like in practice
The most insidious source of false positives is "peeking" — checking results daily and stopping the test the moment p drops below 0.05. Doing that inflates the true false-positive rate from 5% to roughly 30% over a typical test window, because you took 20+ chances to cross the threshold and only stopped on a hit. Fix: pre-commit the test duration, or use sequential testing methods (mSPRT, group sequential designs) that account for repeated looks.
Multiple-comparisons inflation is the second silent killer. A test with one variant and one metric is testing one hypothesis; a test with three variants and four metrics is testing twelve hypotheses. The probability that at least one of those twelve clears p < 0.05 by chance is around 46%, not 5%. The Bonferroni correction (divide α by number of tests) is the simplest fix; the false-discovery-rate (FDR) method is the smarter one. Either is better than ignoring it.
False positives interact with success thresholds in your contract. A small lift that scrapes p < 0.05 might be a false positive OR a real-but-tiny effect; either way, if the lift is below the operational success threshold, the verdict is the same: do not ship. Pre-committing both the statistical threshold AND the practical-significance threshold protects against shipping noise dressed up as signal.
A worked example
A team runs 30 marketing experiments in a year, each with α = 0.05 (5% false-positive rate). Even if half of those experiments are testing changes with no real effect, statistically they will see roughly 0.05 × 15 = 0.75 false "wins" — call it one shipped change per year that "worked" only because the noise happened to fall the right way. The fix is not eliminating false positives (impossible) but pricing them in: lower α for high-stakes decisions, accept α for cheap reversible ones.
Common mistakes
- Treating p < 0.05 as proof.p < 0.05 means "5% chance this happened by accident." Across many tests, that 5% adds up. Significance is necessary, not sufficient.
- Peeking and stopping early.Continuously monitoring and stopping when p first crosses the threshold dramatically inflates false positives. Either run the full duration or use sequential methods designed for early stopping.
- Ignoring multiple comparisons.Testing many metrics or many variants without correcting α inflates false positives geometrically. Bonferroni or FDR — pick one, document it in the contract.
Related terms
Pick a hypothesis. Vocabulary done.
The fastest way to learn this vocabulary is to commit one experiment. The contract takes about five minutes to write.