Experimentation
Interpreting results
The experiment detail page has two tabs — Overview and Stats. Overview gives you a quick verdict and per-variant summary. Stats gives you the full statistical analysis. This page explains what every number means and how to act on it.
Overview tab
The Overview tab shows the stats engine's verdict at the top, followed by the Outcome metrics panel and the Variant comparison table.
Stats tab
Statistical Power card
Shows how far you are from the estimated target session count. The bar turns green at 100%.
Statistical Power
Low powerTarget estimated for 5% MDE at 80% power (α=5%), derived from baseline goal-completion rate.
- current_min_n — the smallest session count across all variants. The experiment is only as powerful as its smallest arm.
- target_n — estimated sessions per variant for 80% power to detect a 5% relative change in goal-completion rate at α=0.05.
- Low power — fewer than 30 sessions in the smallest arm. Results are unreliable.
- Insufficient — fewer than 10 sessions. No stats are computed.
Statistical verdict
The verdict is computed from primary metrics only (goal completion and score). Cost, error rate, retry rate, and custom metrics appear in the forest plot for context but do not change the verdict — a better model that costs more is still worth shipping.
Forest plot (ComparisonPanel)
Each treatment variant gets a forest plot showing the estimated effect of switching from baseline to that treatment, per metric. In a multi-variant experiment, use the pill selector above the plot to choose which treatment to inspect.
Annotated forest plot
Ranking table (multi-variant)
When there are two or more treatment variants, the Stats tab shows a ranking table above the forest plot. Variants are sorted by goal-completion delta descending — the top row is the best-performing treatment. A ▲ mark highlights the current leader.
Statistical methodology
Understanding the tests behind the numbers helps you know when to trust a verdict and when to collect more data.
Hypothesis tests
Significance level (α)
The default significance threshold is α = 0.05 (5% false-positive rate). A metric is marked significant when its p-value falls below α. "Trending" means p is between α and 2α — a real effect may exist but more data is needed.
Bonferroni correction (multi-variant)
When you have k treatment variants, you run k simultaneous hypothesis tests. If each test has a 5% false-positive rate and you run 7 tests, the probability of at least one false positive is ~30%. Bonferroni correction controls this by dividing α by k:
| Treatments (k) | Corrected α per test | Notes |
|---|---|---|
| 1 | 0.0500 | Standard two-variant experiment |
| 2 | 0.0250 | |
| 3 | 0.0167 | |
| 4 | 0.0125 | |
| 5 | 0.0100 | |
| 7 | 0.0071 | 7 treatments — Bonferroni applies |
This means that with 7 treatments you need a much stronger signal (p < 0.007) for any individual comparison to count as significant. Real effects will still be detected, but borderline effects will correctly appear as not significant.
Sample Ratio Mismatch (SRM)
An SRM is flagged when the observed session counts deviate significantly from the configured traffic weights. For example, if you configured a 50/50 split but observe 60/40, the randomisation unit may be broken.
- A red banner at the top of the Stats tab warns you when SRM is detected.
- SRM does not automatically invalidate results — but you should investigate the cause before shipping.
- Common causes: sessions not being created for every agent run, bugs in the variant assignment logic, or the experiment ID being wrong for some sessions.
Acting on results
- Ship it — at least one primary metric significantly improved, none significantly worsened. Promote the winning variant. See Shipping a winner.
- Revert — at least one primary metric significantly worsened, none improved. Stick with the baseline.
- Mixed results — some primary metrics improved, others worsened. Requires a judgment call. Look at effect sizes, not just significance.
- Insufficient data — not enough sessions to run tests, or no primary metrics reported. Keep running.