Data-driven model selection
Ship the best model, not just any model
A/B test any variable — model, temperature, system prompt, or guardrails — with live production traffic. Get statistically grounded results, then promote the winner in one click.
Traffic splitting
Assign any traffic weight per variant. Route 10 % to GPT-4o and 90 % to Claude — or anything in between.
Live metrics comparison
Cost, latency (p50/p95), steps, error rate, and retry rate — updated in real time as sessions complete.
Baseline comparison
Set a baseline variant and see percentage deltas for every metric across all other variants.
Automatic verdict
Niitaka computes a ship · revert · mixed · insufficient_data verdict with confidence level for you.
Promote to version
Winning variant becomes a new AgentVersion in one click — model, prompt, and guardrails all captured.
Per-variant guardrails
Test different cost limits and retry strategies per variant without touching the Policies table.
At a glance
- Weighted traffic split across unlimited variants
- Test model, temperature, system prompt, or guardrails
- Live cost / latency / error metrics per variant
- Statistical verdict: ship · revert · mixed
- One-click promote winner to production version
Common questions
Do I need to deploy new code to run an A/B test?
No. You create the experiment and configure its variants entirely in the dashboard. The SDK fetches the active variant assignment at session start via runtime config — your agent code does not change. To switch from variant A to variant B, you adjust the traffic weights in the dashboard.
How many sessions do I need for a statistically valid result?
It depends on the effect size you expect. Niitaka shows a confidence level alongside each verdict and flags results as insufficient_data when the sample is too small. As a rough guide, 50+ sessions per variant gives meaningful signal for cost and latency comparisons. For smaller effect sizes you may need several hundred.
More general questions? See the full FAQ →
Ready to get started?
Connect your first agent in under 5 minutes. Free to start, no credit card required.
Next: Evaluation