Frequentist statistics
The dominant statistical framework for A/B testing — answers "if there were no effect, how often would I see data this extreme?" via p-values and confidence intervals. The framework most legacy experimentation platforms are built on.
Frequentist statistics treats probability as the long-run frequency of an outcome over many imagined repetitions of an experiment. A p-value of 0.04 means "if the true effect were zero, I would see data at least this extreme 4% of the time across infinite repetitions of this test." The framework underpins p-values, confidence intervals, hypothesis testing, and the entire vocabulary of significance that most teams learned first.
The frequentist framing is rigorous but counterintuitive — most stakeholders hearing "p = 0.04" think it means "96% chance the variant works," which it does not. It means "if there were no effect, this result would be a 1-in-25 fluke." The gap between the two interpretations causes most of the confusion in experiment review meetings. Frequentist methods are still the institutional default in academic research, regulated industries, and older experimentation platforms; the Bayesian alternative has been gaining ground in newer tools.
When to use it
Stick with frequentist methods if your platform is built on them, your audit/regulatory environment expects them, or your team is fluent in the vocabulary. Switch to Bayesian when stakeholders consistently misinterpret p-values or when peeking is a recurring temptation.
What this looks like in practice
The most common frequentist mistake is interpreting p < 0.05 as "95% chance the variant works." It does not say that — it says "if the true effect were zero, this result would happen 5% of the time." The probability of the variant being better given the data (the question stakeholders actually want answered) is a different calculation that requires a prior, which frequentist methods deliberately do not use. The mismatch between "what you can answer" and "what you want to answer" is the philosophical complaint about frequentism.
Confidence intervals are the second source of confusion. A 95% confidence interval of [1.2%, 10.8%] does NOT mean "95% chance the true effect is in that range." It means "if I ran this test infinite times, 95% of the resulting intervals would contain the true effect." For a single test, the true effect is either in the interval or it is not — you just do not know which. The interval is a property of the procedure, not a probabilistic statement about the parameter.
Despite the interpretation pitfalls, frequentist methods have one major operational advantage: they are well-defined, well-audited, and accepted everywhere. Switching to Bayesian methods means picking priors, defending those priors to skeptical stakeholders, and educating the team in a new vocabulary. Most marketing teams should pick one framework, document it in their experiment process, and stick with it. The framework matters less than consistent rigor inside it.
A worked example
A landing-page test runs for two weeks. Frequentist read: "p = 0.03, observed lift 6%, 95% confidence interval [1.2%, 10.8%]." That means: assuming the true effect is zero, this result would happen 3% of the time; the observed lift is 6%; if you ran this test infinite times, 95% of the resulting confidence intervals would contain the true effect. Operationally translatable to "ship if your contract said clear p < 0.05 ships," but the interpretation requires care.
Common mistakes
- Misinterpreting p-values as probability of being right.p = 0.04 does not mean "96% chance the variant works." It means "if the variant did nothing, I would see data this extreme 4% of the time." Different question, different answer.
- Treating confidence intervals as Bayesian credible intervals.A 95% confidence interval is a statement about the procedure, not the parameter. Bayesian credible intervals make the probability statement people want; confidence intervals do not.
- Peeking under frequentist methods.Frequentist p-values assume a fixed sample size. Stopping early when p crosses the threshold inflates the actual false-positive rate from 5% to ~30%. Use sequential designs (mSPRT, group sequential) if you need to peek.
Related terms
Pick a hypothesis. Vocabulary done.
The fastest way to learn this vocabulary is to commit one experiment. The contract takes about five minutes to write.