HS
Harshit Singh
Say hi
๐Ÿš€ Advanced Product Managementยทadvancedยท8 min

๐ŸงชA/B Testing โ€” Advanced

Past the 'set up an experiment in Optimizely' phase: power analysis, network effects, sequential testing, and not getting burned by noise.

experimentationmetrics
Why it matters

Most companies run A/B tests, and most of them get burned by underpowered tests, network effects, novelty effects, or peeking. The PMs who understand the statistics and the pitfalls ship real wins; the ones who don't ship hallucinated wins.

The core idea

Real experimentation is a discipline: a clear hypothesis, sufficient power, the right statistical method (frequentist or Bayesian), guardrails for harm, and honesty about what you can and can't conclude. The math matters, but the cultural discipline matters more.

The seven failure modes

1. Underpowered tests. You need enough sample size to detect the effect you care about. Run a power analysis before launch. Most teams skip this and 'learn' nothing from 80% of their tests.

2. Peeking. Looking at results daily and stopping when you see significance is statistical malpractice. Pre-commit to a sample size or use sequential testing methods.

3. Multiple comparisons. Testing 10 metrics? One will be 'significant' by chance. Use Bonferroni correction or focus on a single primary metric.

4. Network effects. Marketplace and social products violate the SUTVA assumption (one user's treatment affects another's outcome). Switchback tests or geo splits required.

5. Novelty / primacy effects. New thing gets clicked more just because it's new (novelty), or less because it's unfamiliar (primacy). Run for at least 2 cycles of the typical user behavior.

6. Heterogeneous treatment effects. Average shows no effect, but power users gain and new users lose. Always slice by segment.

7. Survivorship. Treatment retains a different user mix than control. Compare like-cohorts.

The disciplined experiment loop

  1. Hypothesis. "We believe X will move Y by Z%, because [theory]."
  2. Power analysis. Sample size required to detect Z% with 80% power, 5% significance.
  3. Pre-registration. Document hypothesis, sample size, success criterion, guardrails โ€” BEFORE launch.
  4. Launch. Random assignment, instrumentation, monitoring.
  5. Don't peek. Wait for the pre-committed sample.
  6. Analyze. Primary metric, guardrails, segment slices.
  7. Decide. Ship, kill, or iterate. Document the learning.

When to A/B test

  • High-traffic, fast-feedback features (consumer growth, marketing pages).
  • Reversible decisions where the data will be conclusive in 2-4 weeks.

When NOT to A/B test

  • Strategic decisions (rebrand, pricing structure change).
  • Low-traffic features where it would take 6 months to reach significance.
  • Decisions you can't unwind (legal, compliance).
  • Things where the answer is obvious โ€” just ship.

Aakash Gupta has written that even strong teams over-A/B-test. The senior move: distinguish what should be A/B tested from what should just be shipped based on judgment.

Bayesian vs Frequentist

Most companies use frequentist (p < 0.05). Bayesian is gaining adoption โ€” it lets you make probabilistic statements ('70% probability variant B is better') and stop early without statistical malpractice. Both work; pick one and be consistent.

Key frameworks

Power analysis

Calculate required sample size before launch. Underpowered tests are worse than no tests.

SUTVA

Stable Unit Treatment Value Assumption โ€” violated by network effects. Switchback tests if so.

Pre-registration

Document hypothesis and analysis plan before launch. Prevents post-hoc rationalization.

Real-world examples

Netflix
Netflix
Experimentation as core competency

Netflix runs hundreds of A/B tests concurrently on a custom platform. Every product change โ€” even thumbnail images โ€” is tested. Their disciplined approach to sample size, guardrails, and reading results has been a competitive advantage for over a decade.

Booking.com
Booking.com
1000+ concurrent tests

Booking famously runs 1000+ concurrent A/B tests. Their PMs are deeply experimental. The discipline produces small but compounding wins; the cumulative effect over years has been billions in revenue.

Go deeper โ€” recommended reading

Interview questions (2)

Q1
Your team ran an A/B test that showed a 2% lift, p < 0.05. The PM wants to ship. What do you check?
metricssenior
โ–ผ

Five checks before shipping:

  1. Was it adequately powered? A 2% lift detected on a small sample is suspect. Check the MDE (minimum detectable effect) and whether the sample matched the power analysis.
  1. Was there peeking? Did anyone look at results daily and call it when significance was hit? If yes, the p-value is inflated; the real significance is lower.
  1. What are the guardrails? Did the test harm any other metric (revenue, retention, latency)? A 2% lift on engagement at the cost of 1% on retention is usually a bad trade.
  1. Segment slices. Does the 2% lift hold across major segments, or is it a single segment driving it? Heterogeneous effects often mean the average lies.
  1. Novelty vs persistent. Did the lift fade in week 2-3? Many launches show novelty bumps that don't sustain.

If those five checks pass, ship with confidence. If two or more fail, hold and dig deeper โ€” a hallucinated win shipped is much worse than a real win delayed by a week.

Q2
When would you NOT A/B test something?
metricssenior
โ–ผ

Five cases:

  1. Strategic decisions. Brand, pricing structure, market positioning โ€” these aren't well-served by A/B tests. Judgment + qualitative + competitive analysis instead.
  1. Low-traffic features. If reaching significance would take 6 months, the cost of the test exceeds the value. Ship and watch.
  1. Irreversible decisions. Legal, compliance, anything where you can't roll back the 'losing' arm.
  1. Network-effect-heavy products without proper test infrastructure. SUTVA violations make standard A/B tests misleading. Use switchback or geo tests.
  1. Obvious decisions. If you'd ship the change regardless of test result, the test is theater. Just ship.

The senior move is recognizing that A/B testing is a tool with diminishing returns once you've tested the obvious things. Use it for moderate-effect decisions on high-traffic features; use judgment for everything else.