Experimentation
Running experiments
Once your variants are set up and the experiment is in Running state, you attach sessions to it via the SDK and report outcome metrics at the end of each run. The stats engine updates automatically as data arrives.
Prerequisites
- The experiment must be in
runningstate (notdraftorcompleted). - At least two variants must exist with the exact names you'll pass to
start_session(). - You have the experiment ID — copy it from the Experiments dashboard URL or the experiment detail page.
Attaching sessions to an experiment
Pass experiment_id and variant to start_session(), then callsession.report_metrics() once the agent finishes:
import niitaka
import openai
client = openai.OpenAI()
with niitaka.start_session(
goal="Classify support ticket",
agent_id="support-classifier",
experiment_id="exp_a1b2c3d4e5f6", # from Experiments dashboard
variant="gpt-4o-mini", # the variant this run is assigned to
) as session:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": ticket_text}],
)
label = parse_label(response.choices[0].message.content)
# Report outcome after the agent run completes
session.report_metrics({
"goal_completed": label is not None,
"score": confidence_score(label), # float 0–1
"accuracy": label == ground_truth_label, # custom metric
})Outcome metrics
Metrics are what the stats engine compares across variants. Two keys are reserved and treated as primary metrics:
You can add up to 10 additional custom metrics — any float value works, but keeping them in the 0–1 range makes the forest plot easy to read:
session.report_metrics({
# ── Required ─────────────────────────────────────────────
"goal_completed": True, # bool — was the task finished?
"score": 0.84, # float — overall quality, 0–1
# ── Custom (up to 10 additional keys) ────────────────────
"relevance": 0.91,
"factuality": 0.88,
"instruction_following": 0.95,
"output_quality": 0.86,
})goal_completed and score are the only metrics that drive the statistical verdict (Ship / Revert / Mixed). Custom metrics appear in the forest plot and ranking table for context but do not change the verdict. See Interpreting results for details.Deferred outcome scoring
You don't have to score a session before it closes. Many quality signals — LLM-as-judge evaluation, downstream pipeline success, or async human review — are only available minutes or hours after the agent run finishes.niitaka.client.report_metrics() accepts a session ID and can be called up to 7 days after the session was created, from any process.
import niitaka
# ── Step 1: run the agent, capture the session ID ────────────────────────────
with niitaka.start_session(
goal="Summarise quarterly report",
agent_id="report-summariser",
experiment_id="exp_a1b2c3d4e5f6",
variant="gpt-4o",
) as session_id:
output = run_agent(task)
# session_id is a plain string — safe to serialise and store
# ── Step 2: score asynchronously (same process, queue worker, or cron) ───────
score, cost_usd = llm_judge.evaluate(output) # your scoring logic
niitaka.client.report_metrics(session_id, {
"goal_completed": score > 0.7,
"score": score,
"_judge_cost_usd": cost_usd, # optional — surfaces judge spend in Niitaka
})
# report_metrics can be called up to 7 days after session creation.report_metrics merge additively — existing keys are not removed, so you can call it multiple times as different scoring signals arrive. The _judge_cost_usd key is optional but recommended when you run an LLM call to produce the score; it keeps judge spend visible in the session cost breakdown.How many sessions do you need?
The stats engine estimates the target session count for you based on the baseline's observed goal-completion rate, assuming 5% minimum detectable effect at 80% power (α=0.05). You can see this in the Statistical Power card on the Stats tab.
Statistical power progress (example: target = 400 sessions/variant)
- A typical target is 200–500 sessions per variant for goal-completion rates around 60–80%.
- The power bar turns amber below 30%, green at 100%. Interpret results only after it reaches 100%.
- Multi-variant experiments need the same sessions per variant but take longer to reach significance because of Bonferroni correction — see Interpreting results.
Running both variants concurrently
Variants must run at the same time, not sequentially. Running variant A in January and variant B in February conflates the treatment effect with seasonal or data distribution changes. For batch loads, interleave jobs across variants:
import random
from concurrent.futures import ThreadPoolExecutor
VARIANTS = ["gpt-4o", "gemini-2.5-flash"]
def run_one(variant: str, task: dict) -> dict:
with niitaka.start_session(
goal=task["goal"],
agent_id="report-summariser",
experiment_id=EXPERIMENT_ID,
variant=variant,
) as session:
result = run_agent(task, model=VARIANT_MODELS[variant])
session.report_metrics({
"goal_completed": result.success,
"score": result.quality_score,
})
return {"variant": variant, "ok": result.success}
tasks = load_tasks() # your production workload
jobs = [(v, t) for t in tasks for v in VARIANTS] # interleave variants
random.shuffle(jobs) # avoid ordering bias
with ThreadPoolExecutor(max_workers=8) as pool:
results = list(pool.map(lambda j: run_one(*j), jobs))Monitoring in real time
The experiment dashboard refreshes its metrics and verdict every time you load or refresh the page. The Stats tab updates the forest plot and ranking table as sessions accumulate — you can watch the confidence intervals narrow as power increases.
When to stop
- Power reached 100% and you have a clear verdict — stop now and act on it.
- Power reached 100% but the result is
insufficient_dataormixed— the effect may be smaller than your MDE. Either continue to collect more data or accept that there's no meaningful difference. - One variant is clearly harming users — stop early. No stat test is worth leaving users on a broken arm.