Experimentation

Running experiments

Once your variants are set up and the experiment is in Running state, you attach sessions to it via the SDK and report outcome metrics at the end of each run. The stats engine updates automatically as data arrives.

Prerequisites

  • The experiment must be in running state (not draft or completed).
  • At least two variants must exist with the exact names you'll pass to start_session().
  • You have the experiment ID — copy it from the Experiments dashboard URL or the experiment detail page.

Attaching sessions to an experiment

Pass experiment_id and variant to start_session(), then callsession.report_metrics() once the agent finishes:

python
import niitaka
import openai

client = openai.OpenAI()

with niitaka.start_session(
    goal="Classify support ticket",
    agent_id="support-classifier",
    experiment_id="exp_a1b2c3d4e5f6",   # from Experiments dashboard
    variant="gpt-4o-mini",               # the variant this run is assigned to
) as session:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": ticket_text}],
    )
    label = parse_label(response.choices[0].message.content)

    # Report outcome after the agent run completes
    session.report_metrics({
        "goal_completed": label is not None,
        "score": confidence_score(label),          # float 0–1
        "accuracy": label == ground_truth_label,   # custom metric
    })

Outcome metrics

Metrics are what the stats engine compares across variants. Two keys are reserved and treated as primary metrics:

goal_completedBoolean. Whether the agent fully accomplished the task. Drives the Goal completion rate comparison — usually the most important primary metric.
scoreFloat 0–1. An overall quality signal — confidence, human rating, LLM-as-judge score. Used as the secondary primary metric.

You can add up to 10 additional custom metrics — any float value works, but keeping them in the 0–1 range makes the forest plot easy to read:

python
session.report_metrics({
    # ── Required ─────────────────────────────────────────────
    "goal_completed": True,     # bool  — was the task finished?
    "score":          0.84,     # float — overall quality, 0–1

    # ── Custom (up to 10 additional keys) ────────────────────
    "relevance":             0.91,
    "factuality":            0.88,
    "instruction_following": 0.95,
    "output_quality":        0.86,
})
Note:goal_completed and score are the only metrics that drive the statistical verdict (Ship / Revert / Mixed). Custom metrics appear in the forest plot and ranking table for context but do not change the verdict. See Interpreting results for details.

Deferred outcome scoring

You don't have to score a session before it closes. Many quality signals — LLM-as-judge evaluation, downstream pipeline success, or async human review — are only available minutes or hours after the agent run finishes.niitaka.client.report_metrics() accepts a session ID and can be called up to 7 days after the session was created, from any process.

python
import niitaka

# ── Step 1: run the agent, capture the session ID ────────────────────────────
with niitaka.start_session(
    goal="Summarise quarterly report",
    agent_id="report-summariser",
    experiment_id="exp_a1b2c3d4e5f6",
    variant="gpt-4o",
) as session_id:
    output = run_agent(task)
    # session_id is a plain string — safe to serialise and store

# ── Step 2: score asynchronously (same process, queue worker, or cron) ───────
score, cost_usd = llm_judge.evaluate(output)   # your scoring logic

niitaka.client.report_metrics(session_id, {
    "goal_completed": score > 0.7,
    "score":          score,
    "_judge_cost_usd": cost_usd,   # optional — surfaces judge spend in Niitaka
})
# report_metrics can be called up to 7 days after session creation.
Note:Calls to report_metrics merge additively — existing keys are not removed, so you can call it multiple times as different scoring signals arrive. The _judge_cost_usd key is optional but recommended when you run an LLM call to produce the score; it keeps judge spend visible in the session cost breakdown.

How many sessions do you need?

The stats engine estimates the target session count for you based on the baseline's observed goal-completion rate, assuming 5% minimum detectable effect at 80% power (α=0.05). You can see this in the Statistical Power card on the Stats tab.

Statistical power progress (example: target = 400 sessions/variant)

50 sessions
13%
100 sessions
26%
200 sessions
52%
300 sessions
78%
400 sessions
100%≥ target
  • A typical target is 200–500 sessions per variant for goal-completion rates around 60–80%.
  • The power bar turns amber below 30%, green at 100%. Interpret results only after it reaches 100%.
  • Multi-variant experiments need the same sessions per variant but take longer to reach significance because of Bonferroni correction — see Interpreting results.

Running both variants concurrently

Variants must run at the same time, not sequentially. Running variant A in January and variant B in February conflates the treatment effect with seasonal or data distribution changes. For batch loads, interleave jobs across variants:

python
import random
from concurrent.futures import ThreadPoolExecutor

VARIANTS = ["gpt-4o", "gemini-2.5-flash"]

def run_one(variant: str, task: dict) -> dict:
    with niitaka.start_session(
        goal=task["goal"],
        agent_id="report-summariser",
        experiment_id=EXPERIMENT_ID,
        variant=variant,
    ) as session:
        result = run_agent(task, model=VARIANT_MODELS[variant])
        session.report_metrics({
            "goal_completed": result.success,
            "score":          result.quality_score,
        })
    return {"variant": variant, "ok": result.success}

tasks = load_tasks()   # your production workload
jobs  = [(v, t) for t in tasks for v in VARIANTS]   # interleave variants
random.shuffle(jobs)   # avoid ordering bias

with ThreadPoolExecutor(max_workers=8) as pool:
    results = list(pool.map(lambda j: run_one(*j), jobs))
Tip:Shuffling the job list before dispatching removes time-based ordering bias — if your workload gets harder later in the day, shuffling ensures both variants see an equal share of easy and hard tasks.

Monitoring in real time

The experiment dashboard refreshes its metrics and verdict every time you load or refresh the page. The Stats tab updates the forest plot and ranking table as sessions accumulate — you can watch the confidence intervals narrow as power increases.

Warning:Do not make shipping decisions based on early results. Looking at results before reaching the target session count inflates false-positive rates. The peeking warning on the Stats tab reminds you of this.

When to stop

  • Power reached 100% and you have a clear verdict — stop now and act on it.
  • Power reached 100% but the result is insufficient_data or mixed — the effect may be smaller than your MDE. Either continue to collect more data or accept that there's no meaningful difference.
  • One variant is clearly harming users — stop early. No stat test is worth leaving users on a broken arm.