Power · SRM · CUPED · Benjamini-Hochberg · Decision doc

Most A/B-test portfolios stop at a t-test.
This one walks the full lifecycle.

Pre-experiment design. Mid-experiment guardrails. Post-experiment analysis with variance reduction. A written ship/no-ship memo.

relative conversion lift
p = 0.0068 · SRM passes · 0/8 segments after BH
Conversion
5.49% → 6.01%
p = 0.0068
Revenue/visitor
+$0.225
p = 0.0097
SRM chi²
p=0.215
assignment sound
BH segments
0 / 8
significant after correction
The scenario

Grey button → green button. Did it move conversion?

Decision rule, written before the experiment, not after: ship if conversion lift is significant at α = 0.05 and practically significant at ≥+5% relative and revenue per visitor is not statistically worse.

Everyone runs the t-test. Almost nobody runs the sample-ratio-mismatch chi-square or applies Benjamini-Hochberg to segment breakouts. Skipping those is how you ship a broken experiment and call it a win.

The structure: design → simulate → validity → analyze → auto-generated DECISION_DOC.md. Same playbook Microsoft, Booking, and Airbnb publish.

CUPED θ ≈ 0.014

A weak covariate barely shrinks the CI — and that's instructive. In production you'd use last-30-day sessions + tenure and expect 30-50% CI reduction.

The CI doesn't touch zero

Conversion lift with 95% confidence interval.

Treatment lift entirely on the right of zero. If the interval crossed zero, the result is noise. It doesn't.

Control · CVR 5.49%
— reference —
Treatment · CVR 6.01% · +9.4% relative
95% CI: [+0.14, +0.89] pp
−1 pp 0 (null) +1 pp
Inside the framework

Six checks, four scripts.

The stack

Built with.

Python 3.12scipystatsmodelsnumpyCUPED from scratchBootstrap CIBenjamini-HochbergPower analysis