Experimentation

What is an Experiment?

An experiment is a randomised, controlled comparison of two or more agent configurations — running simultaneously in production — with statistical analysis to determine which performs better on the metrics that matter to you.

Instead of guessing whether switching from GPT-4o to Gemini Flash will hurt your goal completion rate, you run both in parallel, collect real production data, and let the stats engine tell you — with confidence intervals and a p-value — whether the difference is real or noise.

Anatomy of an experiment

Every experiment has four main ingredients:

ExperimentThe container that holds all variants, controls the traffic split, and owns the statistical results.
VariantA named configuration bundle — model, system prompt, guardrail thresholds. One variant is designated the baseline (control arm).
SessionEach agent run is assigned to exactly one variant. Sessions report outcome metrics that feed the stats engine.
BaselineThe control variant that all treatments are compared against. Typically your current production configuration.

Experiment lifecycle

An experiment moves through three states. Transitions are one-way.

Draft

  • Add / remove variants
  • Set baseline & weights
  • No sessions yet

Running

  • Sessions assigned to variants
  • Metrics collected in real time
  • Variant structure locked

Completed

  • Results final
  • Promote winning variant
  • Experiment archived

The Draft state is the only time you can add or remove variants, change traffic weights, or swap the baseline. Once a session is recorded the experiment is locked into Running and variant structure cannot change. You complete it manually when you have enough data.

How experiments fit into Niitaka

Experiments sit on top of the session and event infrastructure. Your instrumented agent reports to Niitaka as normal; the experiment layer attaches variant metadata to each session and the stats engine aggregates them automatically.

Your Agent

Instrumented with niitaka-sdk

Sessions + Events

Raw telemetry stored per variant

Stats Engine

Aggregates metrics, runs hypothesis tests

Verdict

Ship / Revert / Mixed / Insufficient data

Why run an experiment?

  • Model evaluation — Compare GPT-4o vs Gemini 2.5 Flash on your actual workload, not on benchmarks that may not reflect your use case.
  • Prompt engineering — Measure whether a new system prompt actually improves goal completion, rather than relying on vibes from manual testing.
  • Guardrail tuning — Find the cost-limit threshold that reduces spend without meaningfully reducing quality.
  • Multi-provider failover — Quantify the quality difference of your fallback provider before traffic hits it in an emergency.

When not to use an experiment

  • Bug fixes — If you're fixing a broken prompt, just fix it. You don't need statistical evidence that the bug was bad.
  • Infrastructure changes — Switching from one embedding provider to another for cost reasons, where both produce equivalent results, doesn't need an experiment.
  • Very low traffic — Experiments need volume to reach statistical power. If your agent runs fewer than ~20 sessions per day, reaching 200 sessions per variant takes weeks. Consider synthetic load testing instead.
Tip:Start with a two-variant experiment (current config vs one change). Multi-variant experiments (up to 8 variants) are supported but require more sessions to reach significance after Bonferroni correction.

Next steps