Power · SRM · CUPED · Benjamini-Hochberg · Decision doc

Most A/B-test portfolios stop at a t-test.
This one walks the full lifecycle.

Pre-experiment design. Mid-experiment guardrails. Post-experiment analysis with variance reduction. A written ship/no-ship memo.

relative conversion lift
p = 0.0068 · SRM passes · 0/8 segments after BH

Sample

60,000 users · 2 arms

Decision

SHIP · ~$984k annualized

View on GitHub Read the decision doc

Conversion

5.49% → 6.01%

p = 0.0068

Revenue/visitor

+$0.225

p = 0.0097

SRM chi²

p=0.215

assignment sound

BH segments

0 / 8

significant after correction

The scenario

Grey button → green button. Did it move conversion?

Decision rule, written before the experiment, not after: ship if conversion lift is significant at α = 0.05 and practically significant at ≥+5% relative and revenue per visitor is not statistically worse.

Everyone runs the t-test. Almost nobody runs the sample-ratio-mismatch chi-square or applies Benjamini-Hochberg to segment breakouts. Skipping those is how you ship a broken experiment and call it a win.

The structure: design → simulate → validity → analyze → auto-generated DECISION_DOC.md. Same playbook Microsoft, Booking, and Airbnb publish.

CUPED θ ≈ 0.014

A weak covariate barely shrinks the CI — and that's instructive. In production you'd use last-30-day sessions + tenure and expect 30-50% CI reduction.

The CI doesn't touch zero

Conversion lift with 95% confidence interval.

Treatment lift entirely on the right of zero. If the interval crossed zero, the result is noise. It doesn't.

Control · CVR 5.49%

— reference —

Treatment · CVR 6.01% · +9.4% relative

95% CI: [+0.14, +0.89] pp

−1 pp 0 (null) +1 pp

Inside the framework

Six checks, four scripts.

Power analysis — how many users per arm before you press start. Answer: 48k.
SRM detection — chi-square overall and per segment. The check most candidates skip.
Pre-period AA test — placebo confirming random assignment is sound.
CUPED variance reduction — implemented from scratch using pre-experiment revenue.
Benjamini-Hochberg — applied to 8-segment breakouts to control false discoveries.
Auto-generated DECISION_DOC.md — script writes the ship/no-ship memo when it finishes.

The stack

Built with.

Python 3.12scipystatsmodelsnumpyCUPED from scratchBootstrap CIBenjamini-HochbergPower analysis

Most A/B-test portfolios stop at a t-test. This one walks the full lifecycle.

Grey button → green button. Did it move conversion?

Conversion lift with 95% confidence interval.

Six checks, four scripts.

Built with.

Most A/B-test portfolios stop at a t-test.
This one walks the full lifecycle.