// OUTPUTS

Statistical significance

The probability that the difference between your experiment is variant and control is not just random noise. Conventionally measured by a p-value below 0.05 — but the threshold is a choice, not a law.

// what it is

Statistical significance asks a precise question: "If the variant and control were actually identical, how often would I see a difference at least this big just by chance?" A p-value of 0.04 means 4% of the time. The 5% threshold (p < 0.05) is convention, not physics. Some teams pick 1% for high-stakes decisions and 10% for low-stakes reversible ones.

Statistical significance does not mean the result is "real" — it means the result is "unlikely under the null hypothesis." Those are different. A statistically significant lift of 0.3% might still be operationally meaningless. A non-significant lift of 12% might still be the best thing you will learn this quarter. A real verdict needs both the math (significance) and the business floor (the success threshold in your contract).

// when this matters

When to use it

Any time you have enough traffic to run a real A/B test with a credible MDE. For founder-scale experiments, significance is often unreachable — favor a pre-committed kill/success threshold and a clear time box instead of waiting for a p-value that will not arrive.

// deeper

What this looks like in practice

The 5% (p < 0.05) convention exists because Ronald Fisher proposed it in 1925 — not because it is mathematically optimal. For experiments where being wrong costs little (small UI tweaks), 10% is reasonable. For experiments where being wrong costs a lot (price changes affecting all customers), 1% is reasonable. Pick the threshold in the contract based on the cost of a false positive, not by reflex.

Statistical significance and practical significance are not the same. With enough traffic, every tiny difference becomes "significant" — a 0.1% lift on 10 million users will clear p < 0.001. That does not mean ship. The success threshold in your contract is the practical-significance line; significance is the math under it, not a substitute for it.

Frequentist significance (p-values) and Bayesian significance (probability the variant beats control) answer different questions. p < 0.05 frequentist is roughly equivalent to about 95% Bayesian probability of being better. The choice matters less than people argue. What matters is that the team committed to one threshold, in one framework, before the data arrived — and stuck with it.

// example

A worked example

// EXAMPLE

An A/B test on a checkout page shows a 6% lift in conversions with a p-value of 0.03. Statistical significance: yes (p < 0.05). The verdict still requires comparing the 6% lift against the success threshold set in the contract — if you committed to a 5% floor, ship; if you committed to 10%, the verdict is review.

// pitfalls

Common mistakes

Peeking.Stopping the test the first day p drops below 0.05 inflates false positives dramatically. Either run the full pre-committed duration, or use sequential testing methods designed for early stopping.
Treating non-significant as "no effect."Non-significant means the test did not have enough power to detect the effect. The effect may still be there; you just cannot resolve it with this much data.
Optimizing for significance instead of impact.A team that ships 50 statistically-significant 0.3% lifts has a metric that drifts up and a product that nobody noticed change. Significance is a guardrail, not the goal.

// related

Related terms

Pick a hypothesis. Vocabulary done.

The fastest way to learn this vocabulary is to commit one experiment. The contract takes about five minutes to write.

Run your first experiment Browse the field guides