Pre-experiment design. Mid-experiment guardrails. Post-experiment analysis with variance reduction. A written ship/no-ship memo.
Decision rule, written before the experiment, not after: ship if conversion lift is significant at α = 0.05 and practically significant at ≥+5% relative and revenue per visitor is not statistically worse.
Everyone runs the t-test. Almost nobody runs the sample-ratio-mismatch chi-square or applies Benjamini-Hochberg to segment breakouts. Skipping those is how you ship a broken experiment and call it a win.
The structure: design → simulate → validity → analyze → auto-generated DECISION_DOC.md. Same playbook Microsoft, Booking, and Airbnb publish.
A weak covariate barely shrinks the CI — and that's instructive. In production you'd use last-30-day sessions + tenure and expect 30-50% CI reduction.
Treatment lift entirely on the right of zero. If the interval crossed zero, the result is noise. It doesn't.