// OUTPUTS

Bayesian A/B test

An A/B test analyzed with Bayesian methods — outputs the probability that variant beats control rather than a p-value. Easier to read, supports peeking, and aligns with the question business stakeholders actually ask.

// what it is

A Bayesian A/B test answers a different question than the frequentist version. The frequentist asks "if there were no real effect, how often would I see data this extreme?" (the p-value). The Bayesian asks "given the data I have, what is the probability the variant beats control?" The second question is the one stakeholders actually ask, which is part of why Bayesian methods have become the default in newer experimentation platforms.

Practically, a Bayesian test produces statements like "94% probability variant beats control, expected lift 8%, 95% credible interval [3%, 13%]." Those map cleanly to business decisions: ship if the probability of being better is above some threshold (often 95%), wait otherwise. The math handles peeking gracefully — checking results daily does not inflate the false-positive rate the way it does in frequentist tests — though it still rewards picking the read window before launch.

// when this matters

When to use it

Use Bayesian tests when stakeholders find p-values confusing (most teams), when you want graceful peeking semantics, or when you have informative priors from past tests. Stay frequentist when your platform is built for it or when audit/regulatory expectations require traditional methods.

// deeper

What this looks like in practice

The Bayesian framework requires a "prior" — a starting belief about the effect size before seeing the data. With weak priors (the platform default in most tools), the answer at typical sample sizes is roughly equivalent to the frequentist read. With informative priors (e.g., "we have run 50 tests on this surface and the average lift is 2%"), the prior pulls extreme results toward the prior mean, dampening false positives from underpowered tests. That dampening is a feature, not a bug — it codifies what experienced PMs do by gut.

The peeking advantage is real but often misunderstood. Bayesian methods are not magically immune to over-interpretation; they handle peeking better than frequentist methods because the posterior probability is updated continuously rather than being conditioned on a fixed sample size. You can check daily and stop when probability-of-better crosses your decision threshold without inflating long-run error rates the way frequentist peeking does. But you still need a pre-committed threshold and a pre-committed maximum window — Bayesian or not.

Bayesian probability of "variant beats control" is not the same as "this variant works." A 99% probability of beating control by 0.1% is a statistically confident, business-irrelevant lift. The success threshold in the contract is what makes the verdict meaningful — pre-commit "ship if Pr(lift > 5%) > 0.95," not "ship if Pr(variant > control) > 0.95." The latter is the same noise problem frequentist tests have, just dressed in different vocabulary.

// example

A worked example

// EXAMPLE

A pricing-page test runs for two weeks. Bayesian read on day 14: "97% probability variant outperforms control; expected relative lift 12%; 95% credible interval [4%, 19%]." Frequentist read on the same data: "p = 0.018; observed lift 12%; 95% confidence interval [3%, 21%]." Both call ship; the Bayesian framing is easier for the founder reading it on Slack to act on.

// pitfalls

Common mistakes

Picking a flat prior and calling it "objective."A flat prior is not neutral; it implies you believe huge effects are as likely as tiny ones, which is rarely true. Use weakly informative priors based on past tests on the surface — they encode what you actually know.
Stopping at "probability variant beats control > 95%."That threshold catches lifts of 0.1% as wins. Combine with a practical-significance threshold: "Pr(lift > X%) > 0.95," where X is the smallest lift worth shipping for.
Treating Bayesian as immune to bad design.A 100-user test will give you a wide posterior no matter the methodology. Bayesian methods do not rescue underpowered designs; they just present the uncertainty more honestly.

Tool comparison. See how Xi compares with Statsig and Eppo.

// related

Related terms

Pick a hypothesis. Vocabulary done.

The fastest way to learn this vocabulary is to commit one experiment. The contract takes about five minutes to write.

Run your first experiment Browse the field guides