// THE CONTRACT

Statistical power

The probability that an experiment will detect an effect of a given size, if that effect actually exists. Conventionally set at 80%, meaning a properly powered test catches 4 out of 5 real wins.

// what it is

Statistical power is the flip side of false negatives. If false-positive rate (α) is "how often we call a win that is not there," power (1 − β) is "how often we catch a win that is there." Set α at 5% and power at 80% — the convention — and a properly designed test will, on average, miss one in five real effects of the target size. That is not a bug; it is the cost-balanced outcome of how much sample size you can afford.

Power, sample size, MDE, and α are linked by a single equation. Move one and the others must move. Want higher power? Need more sample size, or a bigger MDE, or looser α. Most founder experiments are quietly underpowered — the team plugs in 95% confidence and a 5% MDE, sees the calculator demand 50,000 visitors per variant, and runs the test anyway with 2,000 because that is all they have. The result is a test that, even if the change works, has a 50%+ chance of returning a non-significant result.

// when this matters

When to use it

Compute power before launching any A/B test. If power at your achievable sample size is below 70%, redesign the experiment — pick a bigger MDE, change the surface, or run a directional time-boxed test instead of pretending you have a real A/B.

// deeper

What this looks like in practice

The 80% power convention is a tradeoff: it accepts a 20% false-negative rate as the cost of keeping sample-size requirements feasible. Move to 90% power and your sample requirement jumps roughly 30%. Drop to 50% power and you essentially have a coin flip dressed up as a test. For most marketing experiments, 80% is fine; for kill/keep decisions on big features, 90% is worth the cost.

Power is the metric that exposes the founder-traffic problem most cleanly. With 1,000 visitors per variant and a 5% baseline, you have ~25% power to detect a 20% relative lift — meaning the test misses 75% of those wins. Plug in your real numbers before launch (a sample-size calculator does this in seconds) and decide whether the test is worth running, or whether you should aim at a higher-converting upper-funnel surface where the math actually works.

Underpowered tests have a second, subtler cost: when they do return significant results, those results are biased upward. The "winning" effect size is inflated because only the lucky-large draws cross the threshold. Teams that ship based on underpowered wins consistently see disappointment when the production rollout does not reproduce the test is reported lift. Power matters for honesty, not just for catching wins.

// example

A worked example

// EXAMPLE

A test designed for 80% power, 5% α, 4% baseline conversion, and a 20% relative MDE requires roughly 7,000 visitors per variant. Running the same test with 2,000 visitors per variant cuts the actual power to roughly 30% — the test will miss 70% of real 20% lifts. The math says: do not bother running until you can afford the sample, or accept that the verdict on a "no significant difference" is "we have no idea," not "the change did not work."

// pitfalls

Common mistakes

Skipping the power calculation.A test you ran without checking power is a test you cannot interpret. "No significant difference" might mean the change did not work, or it might mean the test never could have detected it.
Treating low power as acceptable for "directional" reads.A 30%-powered test cannot reliably tell you direction either; the noise dominates the signal. If you cannot afford power, do not pretend the result is a verdict — use a different design.
Boosting power by lowering MDE without lowering it honestly.Setting the MDE arbitrarily small to make the calculator return a smaller sample size hides the problem instead of solving it. The real MDE is the smallest effect you can act on; pretending otherwise just delays the disappointment.

// related

Related terms

Pick a hypothesis. Vocabulary done.

The fastest way to learn this vocabulary is to commit one experiment. The contract takes about five minutes to write.

Run your first experiment Browse the field guides