Experimentation

Limitations & constraints

Knowing the constraints upfront saves you from running into them mid-experiment. This page documents every hard limit, soft limit, and design decision that affects how you set up and run experiments.

Quick reference

Constraint	Limit	Notes
Maximum variants per experiment	`8`	Enforced at creation time. Driven by Bonferroni correction — beyond 8 simultaneous tests the corrected α becomes too small to be practical.
Maximum custom metrics	`10`	Per session. goal_completed and score are reserved and do not count toward the 10.
Variant lock	`After first session`	Cannot add, delete, or rename variants once any session is recorded for the experiment.
Metric reporting deadline	`Within the session`	Metrics must be reported before the session context manager exits. Metrics reported after session close are silently dropped.
Minimum sessions for stats	`10 per variant`	Fewer than 10 sessions returns insufficient_data — no hypothesis tests are run.
SRM detection threshold	`p < 0.01`	Chi-squared goodness-of-fit. Flagged in the Stats tab but does not automatically block results.
Concurrent experiments per session	`1`	A session can belong to at most one experiment. Passing two experiment IDs is not supported.

Variant lock after first session

Once the first session is recorded for an experiment, the variant list is permanently locked. You cannot:

Add a new variant arm
Delete an existing variant
Rename a variant
Change which variant is the baseline

Traffic weights can still be updated while the experiment is running — but large weight changes after data collection has started can cause a Sample Ratio Mismatch (SRM).

Warning:If you need to test a new variant mid-experiment, create a new experiment. Do not try to repurpose an existing running experiment by changing names or IDs in your agent code — the sessions will be miscounted.

The UI enforces this by hiding the Add variant button and disabling the Delete button on all variant cards once the experiment has sessions.

8-variant maximum

You can have at most 8 variants per experiment (1 baseline + up to 7 treatments). The limit exists for two reasons:

Statistical correctness — with k simultaneous pairwise tests, Bonferroni correction sets the per-test α to 0.05/k. At k=7 treatments, each test requires p < 0.007 to count as significant. Beyond this, the corrected α becomes so small that you would need thousands of sessions per variant to detect realistic effect sizes.
Practical traffic dilution — 8 equal-weight variants each receive only 12.5% of traffic. With 100 sessions per day, that's 12 sessions per variant per day — reaching 400 sessions takes over a month.

Tip:For large-scale model searches, run two sequential experiments: experiment 1 with variants A–D, experiment 2 with variants E–H, then a final experiment with the best from each.

10 custom metric limit

Each session can report up to 10 custom numeric metrics in addition to the two reserved keys (goal_completed and score). Metrics beyond 10 are silently ignored.

Custom metrics must be reported as floats. Boolean values are accepted and stored as 1.0 / 0.0. There is no enforced range, but values in 0–1 produce the most readable forest plots.

Peeking invalidates statistical guarantees

Looking at experiment results before reaching the target session count and making a decision based on what you see is called peeking. It inflates the false-positive rate far above α=0.05.

False-positive rate vs number of peeks at α=0.05

1 look (correct)

5 looks

~19%

10 looks

~29%

20 looks

~40%

100 looks

~63%

Source: Johari et al. (2015), "Peeking at A/B Tests". Approximate values for illustration.

Niitaka displays a peeking warning at the bottom of the Stats tab. The correct approach is:

Decide on your target session count before starting the experiment.
Only interpret results once the power bar reaches 100%.
If you must check early (e.g. one variant is clearly harmful), use the result to stop, not to ship.

Warning:Repeated peeking with early stopping is the single most common cause of false-positive experiment results in production ML systems. The 5% α threshold is only valid at a single pre-planned analysis.

Sample Ratio Mismatch

An SRM is detected when the observed session distribution deviates significantly from the configured traffic weights (chi-squared test, p < 0.01). It means sessions are not being assigned to variants as expected.

Common causes:

Sessions failing silently before the experiment ID is passed to the SDK — those failures are not counted but would have gone to a specific variant.
The agent code routing some sessions away from the experiment (e.g. a cached response that skips start_session).
Large traffic weight changes after data collection started.
Multiple deployments of the agent code, some with the experiment ID and some without.

Important:An SRM flag means the internal validity of the experiment is compromised. Results may be biased in an unknown direction. Investigate the root cause before drawing any conclusions.

Metric reporting deadline

Outcome metrics must be reported before the start_session() context manager exits. Once the session is closed, any subsequent report_metrics() calls are silently dropped on the backend.

Tip:If your evaluation logic runs asynchronously (e.g. LLM-as-judge scoring that happens after the agent finishes), wrap both the agent run and the evaluation inside the same session context:

Interpreting results

Shipping a winner

Was this page helpful?

Need help? Contact Support Questions? Contact Sales LLM? Read llms.txt