Experimentation

Setting up variants

A variant is a named configuration bundle — model, system prompt, guardrail thresholds — that defines one arm of your experiment. Every session is assigned to exactly one variant.

What a variant contains

Each variant stores an optional config object with three top-level sections:

llm.modelThe model identifier passed to your LLM provider — e.g. "gpt-4o-mini", "gemini-2.0-flash".

llm.temperatureSampling temperature (0–2). If omitted, your agent's default is used.

llm.system_promptThe system prompt injected at session start via Runtime Config.

llm.max_tokensToken cap for completions.

guardrails.cost_limit_usdHard cost ceiling per session in USD.

guardrails.max_stepsMaximum LLM calls before the session is aborted.

guardrails.retryRetry policy — max_retries and backoff_seconds.

guardrails.fallbackModel to fall back to if the primary fails.

Note:The config is optional. You can run an experiment purely by passing a different variant name to start_session() and handling the model selection in your own code — Niitaka will still group and compare the sessions statistically.

Creating variants in the dashboard

1
Open the experiment
Go to Experiments and open a Draft experiment. If you haven't created one yet, click New experiment and link it to your agent.
2
Add a variant
Click Add variant. Give it a short, descriptive name that matches what you'll pass to start_session(variant=...) — for example gpt-4o or gemini-flash. Names are case-sensitive.
3
Configure the variant
Fill in the LLM and guardrail fields you want to test. Fields left blank inherit your agent's default policy configuration.
4
Repeat for all arms
Add one variant per arm. A two-arm experiment (one baseline + one treatment) is the simplest and reaches significance fastest. You can add up to 8 variants total.
5
Set the baseline
One variant must be marked as the baseline (control arm). All statistical comparisons are made relative to it. Typically this is your current production config.

Warning:Once the first session is recorded the variant list is locked. You cannot add, delete, or rename variants while the experiment is running.

Traffic weights

Each variant has a traffic_weight. Sessions are distributed proportionally — a weight of 1.0 on each of three variants gives a 33/33/33 split. You can skew the split if you want to limit exposure to an untested variant:

Equal split (recommended)

50%

baselineweight 1.0

treatmentweight 1.0

Skewed split (limit exposure)

75%

25%

baselineweight 3.0

new-modelweight 1.0

Weights are relative, not percentages. 0.5 / 0.5 and 1.0 / 1.0 produce the same 50/50 split.
For multi-variant experiments, equal weights maximise statistical power for a fixed total session count.
You can update weights while an experiment is Running — note this may cause a Sample Ratio Mismatch (SRM) if the shift is large.

Assigning sessions to variants

Pass experiment_id and variant to start_session(). Both are required for the session to be counted in the experiment.

python

with niitaka.start_session(
    goal="Summarise quarterly report",
    agent_id="report-summariser",
    experiment_id="exp_a1b2c3d4e5f6",   # copy from Experiments dashboard
    variant="gpt-4o",                    # must match the variant name you created exactly
) as session:
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": "..."}],
    )

Important:The variant value must match the variant name in the dashboard exactly (case-sensitive). A typo silently attaches the session to the experiment but with no matching variant — it will not appear in any arm's count.

Using Runtime Config with variants

If you filled in the variant's LLM and guardrail fields, the SDK can automatically fetch that config at session start. This lets you write one agent that adapts to whichever variant it's assigned to, without if/else branches.

python

with niitaka.start_session(
    goal="Summarise quarterly report",
    agent_id="report-summariser",
    experiment_id="exp_a1b2c3d4e5f6",
    variant="gpt-4o",
) as session:
    # SDK auto-fetches this variant's config at session start
    config = niitaka.get_runtime_config(agent_id="report-summariser")

    model         = config["llm"]["model"]          # e.g. "gpt-4o"
    system_prompt = config["llm"]["system_prompt"]
    cost_limit    = config["guardrails"]["cost_limit_usd"]

    response = openai_client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user",   "content": "..."},
        ],
    )

Naming conventions

Use the model name for model comparison experiments: gpt-4o, gemini-2.5-flash.
Use descriptive slugs for prompt or guardrail experiments: v1-concise-prompt, high-cost-limit.
Avoid spaces and special characters — lowercase letters, numbers, and hyphens only.

What is an Experiment?

Running experiments

Was this page helpful?

Need help? Contact Support Questions? Contact Sales LLM? Read llms.txt