// Foundations
A marketing decision run as a test, with a hypothesis, a metric to move, a kill threshold that ends it early, and an end date that ends it on time.
A marketing experiment is the unit of marketing work that has a contract attached. Instead of "let us try this and see," you write down what you are testing, what you are measuring, what would kill it, and when you will decide.
It is broader than an A/B test. An A/B test is a marketing experiment whose surface is a single page. A pricing change, an outbound rewrite, a paid-channel shift, and an onboarding step are all marketing experiments. The shape stays the same: hypothesis, metric, thresholds, dates, verdict.
// EXAMPLEIf we cut the LinkedIn cold DM from three paragraphs to one line, the reply rate will rise from 7% to 15% or higher within 40 sends.
Read in detail →A system that captures every marketing decision as an experiment with a contract and a verdict, and stores the result so the next decision builds on the last.
An experiment OS is the operating system around marketing experiments: the place where contracts are committed, metrics are tracked against thresholds, verdicts are decided, and the archive accumulates.
The category exists because most marketing teams already do experiments, but they do them as one-offs. An experiment OS makes the work compound: every test ends in a verdict, every verdict is saved, and the archive becomes the asset the team learns from.
Read in detail →The practice of running marketing decisions as experiments rather than as opinions. The discipline of testing, measuring, and either shipping or killing each idea.
Marketing experimentation is the practice. The experiment OS is the system that supports it. The marketing experiment is the unit of work.
A team practices marketing experimentation when every nontrivial idea passes through the contract: hypothesis, metric, thresholds, dates, verdict. The opposite of marketing experimentation is opinion-driven marketing, which is what most teams default to.
Read in detail →A back-of-envelope prioritization framework for experiment ideas: rate each on Impact, Confidence, and Ease from 1 to 10, average or multiply the three, sort by score. Picks the next experiment to run.
ICE is the lightest-weight way to triage an experiment backlog. Impact is the size of the win if the test succeeds. Confidence is how sure you are the test will succeed. Ease is the inverse of effort — how cheap is it to run? Score each of the three from 1 to 10, combine them, and the top of the list is the next thing to test.
ICE is not a precision instrument; it is a forcing function. The act of scoring 20 ideas makes you notice that half of them are low-impact, half of the rest are low-confidence, and exactly two are both high-impact and easy. The value is making the comparison explicit so the loudest stakeholder does not just pick.
// EXAMPLE"Test a new pricing page": Impact 9, Confidence 5, Ease 6 → score 6.7. "Test a new email subject line": Impact 3, Confidence 8, Ease 9 → score 6.7. Roughly equivalent priority, so run whichever is fresher — the pricing test wins on lift potential, the email wins on velocity.
Read in detail →The single metric a company commits to as the leading indicator of long-term value delivered to customers. Marketing experiments either move it, leave it, or hurt it — every result reads against that one number.
A north star metric is the one number a team agrees captures the value the product creates for users. It is leading (it predicts revenue, not just records it), measurable in the short term, and durable enough that it stays meaningful for years. Spotify is north star is hours listened. Airbnb is is nights booked. Slack is messages sent in a single team. The shape of the metric reveals what the company believes growth actually means.
For an experiment program, the north star is the gravitational center. Every test either moves it, is neutral against it, or pulls against it — and the verdict has to read against that fact. A 20% lift in signups that does not move the north star is not necessarily a win; it might just be acquiring people who never deliver value. Pre-committing the north star in the experiment contract turns "did this work" into a real question with a real answer instead of a meeting about which dashboard slice to celebrate.
// EXAMPLEA B2B SaaS picks "weekly active workspaces with at least three connected integrations" as the north star — it is a leading indicator of paid retention and uniquely measurable. A landing-page test that lifts signups 30% but does not move that number after 60 days lands as a verdict of "no effect" in the archive, even though the headline lift was real.
Read in detail →A reinforcing system where the output of one cycle becomes the input of the next — new users generate content, links, or referrals that bring in more new users. Replaces the leaky-funnel model with a flywheel.
A growth loop is a reinforcing cycle: an action by a user produces an output (content, a referral, a link, an integration) that pulls a new user in, who then produces more of the same output. Pinterest is loop: new users save pins → pins rank in search → searchers discover Pinterest → some sign up and save pins. The loop is closed; it does not run out as long as users keep doing the action.
Loops differ from funnels in a structural way. A funnel is one-shot: traffic arrives, some converts, the rest is lost forever. A loop is compounding: every retained user is also an acquisition channel for the next cohort. The implication for experiments is huge — a test that improves a loop step (e.g., the rate at which new users invite teammates) compounds across cohorts in a way a funnel-step test cannot. Pick experiments that move the loop is bottleneck, not the funnel is widest spot.
// EXAMPLEA B2B tool with a sharing-driven loop tests two changes. Change A lifts signup conversion by 20% (funnel-step). Change B lifts the rate at which new users invite at least one teammate from 30% to 36% (loop-step). At month one the funnel-step lift looks bigger; by month six the loop-step lift has compounded into 2-3x the cumulative new-user count, because every additional invite produces more signups in the next cycle.
Read in detail →An experiment that tests multiple variables at once in all combinations — e.g., 2 headlines × 3 button colors × 2 layouts = 12 variants. Reveals interaction effects an A/B test cannot, but demands much more traffic.
A multivariate test runs all combinations of multiple variables in parallel. With three variables tested at two levels each, the test has 2 × 2 × 2 = 8 variants. The math lets you measure not just the main effect of each variable, but also interactions — whether the green button works better with headline A and the orange button works better with headline B. A/B testing one variable at a time cannot find those interactions.
The catch is sample size. An A/B test that needs 1,000 visitors per variant for adequate power needs roughly 8,000 for an 8-cell MVT — and that is just for the main effects, never mind detecting interactions, which require dramatically more traffic. For most founders, multivariate tests are mathematically out of reach. The right call is usually to A/B test sequentially: ship the headline winner, then test button color on top of it, accepting that you might miss interactions in exchange for verdicts that actually arrive.
// EXAMPLEA landing-page MVT tests 2 headlines × 2 hero images × 3 CTA variants = 12 cells. To detect a 20% relative lift on a 4% baseline at 80% power, each cell needs ~3,500 visitors — 42,000 total. The team has 5,000 visitors per week, so the test takes 8+ weeks. Verdict: redesign as three sequential A/B tests on the highest-leverage variable, rather than waiting two months for an MVT that may still be underpowered for interactions.
Read in detail →// The contract
The agreement, written before an experiment runs, that says what is being tested, how it will be measured, what kills it, and when it ends.
An experiment contract is a four-part artifact: a hypothesis, a metric, a kill threshold paired with a success threshold, and an end date. Once the experiment starts, the contract does not change.
The contract is what stops drift. Without it, the team rewrites success criteria as the data arrives. With it, the rules decide the verdict, not the room.
Read in detail →A clear statement of what change you are making, what you expect to move, and the size of the move you would call a win, written before the experiment starts.
A useful marketing hypothesis has three parts: the change ("if we cut the cold DM to one line"), the predicted effect ("reply rate rises from 7% to 15% or higher"), and the time or volume window ("within 40 sends"). Each part is testable.
Hypotheses written without all three parts collapse into vibes. "Maybe content will work" is not a hypothesis. "If we publish two long-form posts a week for one quarter, organic signups will rise from 50 to 120 a month" is.
// EXAMPLEIf the landing page leads with one clear promise instead of five features, signup rate will rise from 10% to 12% or higher across 1,000 visits.
Read in detail →The metric value that ends an experiment early. You set it up front so you do not negotiate with yourself when the numbers come in.
A kill threshold is committed in the contract before the experiment runs. If the metric drops below it (for higher-better metrics) or above it (for lower-better metrics), the experiment ends and the verdict is kill.
The kill threshold is not a hope. It is a pre-commitment. Its job is to make the decision automatic when the data is bad, so the team does not spend a quarter arguing about whether to keep going.
// EXAMPLEFor a paid-channel test where blended CAC is the metric, the kill threshold might be $210 against a control of $185. If CAC clears $210 mid-flight, the experiment is killed and the spend reverts.
Read in detail →The metric value above which the experiment is shipped. Set up front, alongside the kill threshold, so the verdict is decided by the rules and not by the room.
A success threshold is the inverse of a kill threshold. It is the value the metric must hit (or beat, depending on direction) by the end date for the verdict to be ship.
Setting both thresholds in the contract is what makes the verdict automatic. Hit the success threshold, ship. Hit the kill threshold, kill. Land between them, review with whatever you learned.
Read in detail →The minimum measurable change you commit to before the experiment starts. The number below which a result is not interesting.
The hypothesis floor is the smallest move on the metric that justifies the cost of running the experiment in the first place. It is a stricter version of the success threshold: instead of "the line above which we ship," it is "the line below which we did not learn anything new."
Setting a hypothesis floor up front prevents the most common marketing failure mode: shipping a 1.2% improvement that took six weeks of work and calling it a win.
Read in detail →The smallest change in a metric that an experiment can reliably detect, given your sample size, baseline rate, and statistical confidence. A constraint to set before launch — not a result that drops out after.
MDE answers a precise question: "If I run this experiment, how big does the effect have to be for me to actually see it in the data?" It depends on three inputs you control before launch — the baseline conversion rate, the sample size per variant, and the statistical power and significance you require. Plug those into a sample-size calculator and the MDE drops out.
For founder-scale traffic, MDE is usually the punchline of a stats conversation. With a 5% baseline conversion rate and 1,000 visitors per variant, the smallest lift you can reliably detect is around 30 to 40% relative — anything smaller will look like noise. That is why most founder A/B tests never reach significance. MDE makes the impossibility legible up front, before you waste a quarter on a test that mathematically cannot conclude.
// EXAMPLEAt a 4% baseline conversion rate, 1,000 visitors per variant, 80% power, and 95% confidence, the MDE is roughly 50% relative — meaning the new variant has to convert at 6% or higher for the test to call it a real win. If the change you are testing might plausibly deliver 10%, the test is theatre.
Read in detail →A secondary metric you commit to monitoring during an experiment to make sure the variant is not winning on the primary metric by breaking something else. The "do no harm" check on a contract.
A guardrail metric answers "what could this experiment break that I want to make sure I catch?" Every contract has a primary metric — the number the experiment is trying to move. Guardrails are the other numbers you commit to watching to make sure the win is real and not coming at the expense of revenue, retention, or experience.
Guardrails are most useful when the primary metric is upstream of business outcomes. A signup-rate test should guardrail downstream activation. A price-cut test should guardrail revenue per visitor. An outbound-volume test should guardrail reply quality and unsubscribe rate. Without guardrails, a team can ship a string of "wins" that move the headline number while quietly tanking the business.
// EXAMPLEA pricing experiment that cuts the monthly price by 30% might lift signup rate (primary metric) by 50%. The contract is guardrail metric — revenue per visitor — would have to stay flat or improve, or the "win" is a loss in disguise. If revenue per visitor drops more than 5%, the verdict is kill regardless of the signup-rate lift.
Read in detail →A group of users deliberately excluded from a treatment so the team can measure long-term incremental impact. The control variant for experiments whose effects only show up over weeks or months.
A holdout is a randomly assigned slice of the audience that does not receive a treatment, kept separate so that long-term effects can be measured against a clean baseline. Standard A/B tests run for days or weeks; holdouts run for months or quarters. The point is to answer "did this campaign / this feature / this re-engagement series actually generate incremental value, or would those users have converted anyway?"
Holdouts are how mature growth teams measure things that A/B tests cannot — the cumulative effect of a year of email re-engagement, the lifetime impact of a loyalty program, the decay rate of brand-marketing campaigns. The key discipline is treating the holdout as sacred: do not shrink it because you are leaving "free revenue on the table," do not contaminate it with messaging through other channels, and do not end it early when the directional read goes the way you hoped.
// EXAMPLEA consumer-app team holds 5% of eligible users out of all push-notification re-engagement for six months. At month six, the held-out cohort has 8% lower 90-day retention than the treated cohort — meaning the push program drove an incremental 8 percentage points of retention, worth quantifiable lifetime revenue. Without the holdout, the team would have argued forever about whether push was working or whether it was just nagging engaged users.
Read in detail →The probability that an experiment will detect an effect of a given size, if that effect actually exists. Conventionally set at 80%, meaning a properly powered test catches 4 out of 5 real wins.
Statistical power is the flip side of false negatives. If false-positive rate (α) is "how often we call a win that is not there," power (1 − β) is "how often we catch a win that is there." Set α at 5% and power at 80% — the convention — and a properly designed test will, on average, miss one in five real effects of the target size. That is not a bug; it is the cost-balanced outcome of how much sample size you can afford.
Power, sample size, MDE, and α are linked by a single equation. Move one and the others must move. Want higher power? Need more sample size, or a bigger MDE, or looser α. Most founder experiments are quietly underpowered — the team plugs in 95% confidence and a 5% MDE, sees the calculator demand 50,000 visitors per variant, and runs the test anyway with 2,000 because that is all they have. The result is a test that, even if the change works, has a 50%+ chance of returning a non-significant result.
// EXAMPLEA test designed for 80% power, 5% α, 4% baseline conversion, and a 20% relative MDE requires roughly 7,000 visitors per variant. Running the same test with 2,000 visitors per variant cuts the actual power to roughly 30% — the test will miss 70% of real 20% lifts. The math says: do not bother running until you can afford the sample, or accept that the verdict on a "no significant difference" is "we have no idea," not "the change did not work."
Read in detail →// Outputs
The outcome of an experiment after the end date or a threshold is hit. Three options: ship, kill, or review.
A verdict is not an opinion. It is what the contract says the result is. Ship if the success threshold was met. Kill if the kill threshold was hit, or if the end date arrived without clearing the success threshold. Review if the data is ambiguous and the contract did not anticipate the case.
The verdict is what makes the experiment compound. Each verdict (ship or kill) becomes a row in the archive that the next experiment can build on.
Read in detail →The record of every experiment a team has run, with the contract, the metric movement, and the verdict. The compounding asset of an experiment-driven team.
The experiment archive is the long-term output of running marketing as experiments. Each entry is a contract plus a verdict plus the evidence behind it. Wins are reusable. Failures are reusable.
A team without an archive runs the same experiment twice every eighteen months. A team with an archive looks up "did we already test this?" before running the next one, and either skips it or designs a sharper version.
Read in detail →The probability that the difference between your experiment is variant and control is not just random noise. Conventionally measured by a p-value below 0.05 — but the threshold is a choice, not a law.
Statistical significance asks a precise question: "If the variant and control were actually identical, how often would I see a difference at least this big just by chance?" A p-value of 0.04 means 4% of the time. The 5% threshold (p < 0.05) is convention, not physics. Some teams pick 1% for high-stakes decisions and 10% for low-stakes reversible ones.
Statistical significance does not mean the result is "real" — it means the result is "unlikely under the null hypothesis." Those are different. A statistically significant lift of 0.3% might still be operationally meaningless. A non-significant lift of 12% might still be the best thing you will learn this quarter. A real verdict needs both the math (significance) and the business floor (the success threshold in your contract).
// EXAMPLEAn A/B test on a checkout page shows a 6% lift in conversions with a p-value of 0.03. Statistical significance: yes (p < 0.05). The verdict still requires comparing the 6% lift against the success threshold set in the contract — if you committed to a 5% floor, ship; if you committed to 10%, the verdict is review.
Read in detail →The change in a metric caused by an experiment, expressed as either an absolute difference (e.g., +1.5 percentage points) or a relative percentage (e.g., +20% over control). The headline number a verdict is built on.
Lift answers "did this work, and by how much?" Absolute lift is the raw difference — a control at 4% conversion and a variant at 5% has an absolute lift of +1 percentage point. Relative lift is that difference expressed as a percentage of control: same numbers, +25% relative lift. They sound very different, and people mix them up constantly in launch announcements.
Lift is meaningful only against a baseline and a success threshold. A 20% relative lift on a metric that started at 0.2% conversion is +0.04 percentage points absolute — a fact that matters when you compute revenue impact. The number alone is a marketing claim; the number anchored to absolute units and to pre-committed thresholds is a verdict.
// EXAMPLEAn email subject-line test shows the variant got a 22% open rate vs. control is 18%. Relative lift: +22%, computed as (22 − 18) / 18. Absolute lift: +4 percentage points. Whether the verdict is ship depends on whether the success threshold in the contract was set in relative or absolute terms — pre-commit which, or the verdict turns into a vocabulary argument.
Read in detail →The percentage of new signups who reach a defined "first value" milestone within a target window. The single number that separates "got the user to sign up" from "got the user to actually use it."
Activation rate measures how many users actually experience the product is value, not just how many created an account. The activation moment is product-specific — for Slack, "team sent 2,000 messages." For Dropbox, "user uploaded one file from a desktop client." For Twitter, the classic answer: "user follows 30 accounts." The choice of activation event determines what your funnel is actually optimizing toward.
Activation matters more than signup conversion at most stages. A page that doubles signups but halves activation has just made the funnel worse — twice as many signups, half as many users who reach value, the absolute number of activated users is roughly unchanged but operating costs doubled. Pre-commit activation as a guardrail on every signup-flow experiment, or you will ship a long string of "wins" that do not actually grow the active user base.
// EXAMPLEA B2B SaaS defines activation as "added a teammate AND completed one workflow within 7 days." Baseline: 22% of signups activate. A redesigned onboarding tested in an experiment lifts activation to 28% — a +27% relative improvement on the metric that actually predicts paid retention. The signup conversion barely moved; the verdict is ship anyway because activation is the leading indicator that matters.
Read in detail →A group of users sharing a common starting characteristic — usually signup week, acquisition channel, or pricing tier — analyzed together over time. The unit of analysis that lets you see whether a change actually changed behavior.
A cohort is a set of users who share something material — typically the week they signed up, the channel they came through, or the pricing plan they started on. Grouping users into cohorts lets you ask questions like "did the November signups behave differently from the October signups?" — questions that are invisible if you look at the user base in aggregate, because new and old users always blend into a misleading average.
For experiments, cohort analysis is the difference between "the metric went up" and "the metric went up because of what we shipped." A retention chart on aggregate users will look better any time you ramp acquisition (more young users in the mix) and worse any time you slow acquisition. Cohort retention curves separate the two: each cohort is its own trajectory, and a real product improvement is visible as later cohorts retaining better than earlier ones at the same week-N.
// EXAMPLEA pricing-page change goes live in week 14. Aggregate retention looks flat. The cohort view shows week-14-and-later cohorts retaining 6 percentage points higher at day-30 than week-13-and-earlier cohorts. The aggregate flatness was a mix effect — earlier weak cohorts dragging the average down — and the pricing-page change actually worked. Verdict: ship was already true, the cohort view made it visible.
Read in detail →An experiment result that calls a real win when no real effect exists — the variant got lucky in this draw of users. Controlled by the significance threshold (α); the cost of being wrong determines how strict to set it.
A false positive — also called a Type I error — is when an experiment crosses the statistical-significance threshold even though the true effect is zero. Random variation in user behavior produces some noise; a chunk of that noise will, by chance, look like signal. The 5% (p < 0.05) convention means: if the true effect were zero, the test would call it a "win" 5% of the time anyway. That 5% is your false-positive rate by construction.
False positives are not rare freak events; they are baked into the math. If a team runs 20 A/B tests on changes that secretly do nothing, on average one will declare a "winning" lift just by luck. Multiply that across a year of experiments and false-positive ships explain a noticeable share of the "we shipped it but the metric never moved" story. Pre-commit the significance threshold in the contract; do not negotiate it after the data arrives.
// EXAMPLEA team runs 30 marketing experiments in a year, each with α = 0.05 (5% false-positive rate). Even if half of those experiments are testing changes with no real effect, statistically they will see roughly 0.05 × 15 = 0.75 false "wins" — call it one shipped change per year that "worked" only because the noise happened to fall the right way. The fix is not eliminating false positives (impossible) but pricing them in: lower α for high-stakes decisions, accept α for cheap reversible ones.
Read in detail →An A/B test analyzed with Bayesian methods — outputs the probability that variant beats control rather than a p-value. Easier to read, supports peeking, and aligns with the question business stakeholders actually ask.
A Bayesian A/B test answers a different question than the frequentist version. The frequentist asks "if there were no real effect, how often would I see data this extreme?" (the p-value). The Bayesian asks "given the data I have, what is the probability the variant beats control?" The second question is the one stakeholders actually ask, which is part of why Bayesian methods have become the default in newer experimentation platforms.
Practically, a Bayesian test produces statements like "94% probability variant beats control, expected lift 8%, 95% credible interval [3%, 13%]." Those map cleanly to business decisions: ship if the probability of being better is above some threshold (often 95%), wait otherwise. The math handles peeking gracefully — checking results daily does not inflate the false-positive rate the way it does in frequentist tests — though it still rewards picking the read window before launch.
// EXAMPLEA pricing-page test runs for two weeks. Bayesian read on day 14: "97% probability variant outperforms control; expected relative lift 12%; 95% credible interval [4%, 19%]." Frequentist read on the same data: "p = 0.018; observed lift 12%; 95% confidence interval [3%, 21%]." Both call ship; the Bayesian framing is easier for the founder reading it on Slack to act on.
Read in detail →The dominant statistical framework for A/B testing — answers "if there were no effect, how often would I see data this extreme?" via p-values and confidence intervals. The framework most legacy experimentation platforms are built on.
Frequentist statistics treats probability as the long-run frequency of an outcome over many imagined repetitions of an experiment. A p-value of 0.04 means "if the true effect were zero, I would see data at least this extreme 4% of the time across infinite repetitions of this test." The framework underpins p-values, confidence intervals, hypothesis testing, and the entire vocabulary of significance that most teams learned first.
The frequentist framing is rigorous but counterintuitive — most stakeholders hearing "p = 0.04" think it means "96% chance the variant works," which it does not. It means "if there were no effect, this result would be a 1-in-25 fluke." The gap between the two interpretations causes most of the confusion in experiment review meetings. Frequentist methods are still the institutional default in academic research, regulated industries, and older experimentation platforms; the Bayesian alternative has been gaining ground in newer tools.
// EXAMPLEA landing-page test runs for two weeks. Frequentist read: "p = 0.03, observed lift 6%, 95% confidence interval [1.2%, 10.8%]." That means: assuming the true effect is zero, this result would happen 3% of the time; the observed lift is 6%; if you ran this test infinite times, 95% of the resulting confidence intervals would contain the true effect. Operationally translatable to "ship if your contract said clear p < 0.05 ships," but the interpretation requires care.
Read in detail →