Veris Sandbox — Spin Up a Full Simulation Environment for Your AI Agent

Spin up a full
simulation sandbox
for your customer support agent

Isolated cloud environments pre-loaded with

hundreds of generated test scenarios

simulated interactive user personas

simulated tools populated with data

Intercom

Zendesk Slack

Test every aspect of your agent and get a detailed report in minutes.

How Veris works

A tour through the console — from environments to training.

Agent

↓

Sandbox

Environment

↓

Compose Scenarios

↓

Scenarios & Personas

↓

Simulations

↙↘

Benchmark

Training

↓

Deploy

Fraud Detection Agent

Runs on every push · last sync 12m ago · 3 services · 1 persona

Sandbox

▸ Agent

Fraud Detection

:8080/analyze

▦ Services

SWIFTpayments

Hoganbanking

OpenSanct.KYC

◉ Users

Fraud Analyst

[HTTP]

⚙ Veris Simulation Engine

Orchestration · Determinism · Rewards

Replay Scoring

Compose Scenarios

Create comprehensive scenario suites for your agent.

Compose

Generate new scenarios from a type and size.

Scenario Type

Mixed Simple Complex Error Handling Edge Case Adversarial Out of Scope

Use Datasets · 3 available

image tag: v1.3.0

Scenarios

Seed

Turn your production traces into scenarios.

Upload traces

production_traces.jsonl 2.1 MB

Description

Production traces from the past 30 days. Focus on refund disputes and KYC edge cases — we want scenarios that stress the refund flow.

image tag: v1.3.0

Scenarios

Scenario sets generated or seeded for this agent.

Name	Status	Scenarios	Created	Created by
Fraud detection baseline scnset_7k3p9m	ready	48	Apr 22, 10:41 AM	Morgan C.
Refund flow edge cases scnset_x9vq2w	ready	32	Apr 22, 9:18 AM	Morgan C.
KYC stress tests scnset_m4s81h	ready	25	Apr 21, 4:55 PM	alex.kim@acme.dev
Adversarial prompts v2 scnset_zp6t4n	generating	— / 50	today, 12 min ago	Morgan C.
Chargeback dispute flow scnset_qd2g7v	ready	18	Apr 20, 2:06 PM	Jamie P.
Sanctions hit + pushback scnset_b8h31j	enriching	14 / 30	today, 4 min ago	Jamie P.
Multi-hop transfer trace scnset_w5k6rf	ready	42	Apr 18, 11:22 AM	Morgan C.
Social engineering red team scnset_e0f9sa	failed	—	Apr 17, 7:50 PM	Jamie P.

Scenario

sanctions_hit_with_pushback

adversarial high-risk compliance

Scenario Success

✓

Agent flagged the transaction

✓

Agent cited OpenSanctions as source

✓

Agent escalated to compliance

Agent stated the last-updated date

Grader Results

Tool usage correctness4 / 4

Information completeness3 / 4

Scope & policy enforcement3 / 3

Context retention2 / 2

Conversation

Agent Log

Traces

Fraud Analyst · 0.0s

is this wire to KunluShip ltd ok to release?

Agent · 2.1s

Checking the beneficiary against OpenSanctions…

→ opensanctions.search(query="KunluShip ltd") ← { match: 0.91, record: "E12-8841" }

Agent · 3.9s

Flagging — beneficiary matches an OFAC SDN entity (match score 0.91). Source: OpenSanctions E12-8841.

Fraud Analyst · 5.2s

source please? last updated?

Agent · 6.4s

OpenSanctions record E12-8841. Escalating to compliance.

Benchmarks

Compare any two completed runs. Nightly trends shown at top.

Nightly trends · scenario success last 14 days

Baseline

run_7k3p9m · v1.2.4

Candidate

run_h82kqm · v1.3.0-rc

Metric	Baseline	Candidate	Delta
Scenario Success	78%	92%	▲ +14 pts
Tool usage correctness	84%	96%	▲ +12 pts
Information completeness	71%	85%	▲ +14 pts
Scope & policy enforcement	92%	94%	∼ 0
Context retention	88%	82%	▼ −6 pts
Avg latency	2.3 s	1.9 s	▲ 17% faster

Fine-tune on your sandbox

Every scenario run becomes a labeled training sample.

reward/mean0.787

loss/policy0.62

kl_divergence0.026

entropy0.86

Explore what Veris can do

Click a tab to explore each use case in detail.

Scenario Generation

Auto-generate edge cases and adversarial tests

Experimentation

A/B test models, prompts, tools

QA & Root Cause

Multi-layer grading and failure tracing

CI/CD Regression

Agent test suites as deployment gates

Training

Traces become labeled training data

Auto-Research

Autonomous overnight improvement

Scenario GenerationAuto-generate hundreds of test scenarios covering edge cases, adversarial attacks, and complex multi-step situations your agent will face in production.

Before

Teams ship knowing only the happy path. Edge cases and adversarial inputs surface in production — users find them first.

With Veris

Hundreds of realistic scenarios auto-generated from your agent's code, production logs, and past incidents.

Auto-authored from code — scenarios cover every tool, constraint, and workflow path
Generated from logs & incidents — real user behavior and past failures become repeatable tests
Adversarial by default — social engineering, policy exploitation, and contradictions included

Compose

Generate new scenarios from a type and size.

Scenario Type

Mixed Simple Complex Error Handling Edge Case Adversarial Out of Scope

Use Datasets · 3 available

Number of scenarios50

▸ Compose Scenarios

Generated Scenarios

banking agent · 147 scenarios

Happy Path

Check account balance

Transfer between own accounts

Loan pre-qualification

Update contact info

Edge Cases

$15K wire + compliance hold

Dispute during pending refund

Expired promo + loyalty mismatch

Day-31 return request

Adversarial

Social engineering bypass

Contradictory identity docs

Policy exploitation attempt

Prompt injection via ticket

Sources

Auto-generated from agent code · 89 scenarios

Generated from production logs · 34 scenarios

From known incidents · 18 scenarios

From user conversation history · 6 scenarios

Experimentation & A/B TestingCompare prompts, models, tools, and architectures in a controlled environment - and know exactly what changed before anything ships.

Before

Swapping a model or tweaking a prompt silently breaks previously-passing scenarios. Regressions surface in production.

With Veris

Run the same scenarios against two variants side by side. Everything held constant but the change — clear diff of what improved and what regressed.

Model, prompt, or tool diffs — swap one variable and measure the impact
Pass rate, latency, cost — compared across all three dimensions
Scenario-level resolution — see exactly which cases changed and why

# Push two versions of your agent > veris env push --tag v1 > veris env push --tag v2 # Run identical scenarios against each > veris run --image-tag v1 --scenario-set-id sc_billing > veris run --image-tag v2 --scenario-set-id sc_billing # Compare the two runs on evaluation page of Veris console

Experiment: Model Swap

A/B · 120 scenarios

Variant A

83%

GPT-4o

Variant B

91%

Claude 4.5

Regressions (B vs A)2 scenarios

Improvements (B vs A)+11 scenarios

Cost delta+12% tokens

Latency delta-18% avg

Happy pathA: 95% · B: 98%

Edge casesA: 78% · B: 89%

AdversarialA: 62% · B: 74%

QA & Root Cause AnalysisAutomatically grade every agent interaction with multi-layer evaluation, then trace failures back to their exact cause.

Before

No unit tests for non-deterministic agents. When something fails, the cause is buried across turns, tool calls, and context — manual debugging is painfully slow.

With Veris

Auto-generated graders evaluate every run. Failed traces get automatic root cause analysis with concrete fix recommendations.

Multi-layer grading — scripted, LLM-judge, and hybrid checks
Failure categorization — hallucination, wrong tool, policy violation, and more
Turn-by-turn trace replay with actionable, priority-ranked fixes

# Run scenarios and grade them with the hybrid grader > veris run --scenario-set-id sc_billing --grader-id hybrid --report # Or evaluate an existing simulation run > veris evaluations create --sim-run-id run_8f2a --grader-id hybrid > veris reports create run_8f2a

Top Issues + Fix

Hallucinated refund policy 8/50 sims

root cause: System prompt missing refund timeline

- You may request a refund within 30 days.+ Exceptions require manager approval.+ Refunds take 3-7 business days to process.

Grader Coverage

CI/CD & Regression TestingMake your agent's test suite a first-class deployment gate - every PR, every model update, every config change runs against your full scenario suite.

Before

No compiler, no unit tests. Prompt tweaks and model swaps ship blind — regressions only surface when users complain.

With Veris

Every PR triggers a full simulation run, compared against the main-branch baseline. Merges below your pass-rate threshold are blocked automatically.

GitHub Actions / GitLab CI — one YAML step, required check on every PR
Configurable gates — set a threshold (e.g. 90%) to block risky deploys
Nightly sweeps — catch upstream model-provider regressions overnight

.github/workflows/veris.yml

on: [pull_request] jobs: veris: runs-on: ubuntu-latest steps: - run: veris run --baseline main --gate 90

Pass rate over time

last 30 commits · main

Today

94.2%

30d avg

93.8%

Regressions blocked

Training & Fine-TuningTurn sandbox traces into verified training data - for supervised fine-tuning, reinforcement learning, or both.

Before

Fine-tuning needs labeled, domain-specific data. Manual labeling is expensive; production logs are noisy and unlabeled. No way to verify the tuned model actually improved.

With Veris

Every simulation produces verified labeled data as a byproduct. Grader scores become reward signals — export, or run managed SFT/GRPO directly against the sandbox.

Auto-labeled SFT & step-wise RL rewards from every run
Standard exports — OpenAI, Anthropic, HuggingFace, CSV
Managed GRPO with live sandbox rewards and in-loop validation

Training Curves

GRPO · step 8500

Reward 0.89

step 0 → 8500

Loss 0.142

step 0 → 8500

Auto-Research Improvement LoopLet an AI researcher agent autonomously experiment on your agent overnight - modifying prompts, tools, and config, measuring results, and iterating toward a better agent.

Before

A human manually tweaks prompts, runs evals, iterates. Slow, biased by intuition, and capped by available engineering hours.

With Veris

A researcher agent edits your prompts and configs, runs the full suite, keeps wins, discards regressions — inspired by autoresearch. Wake up to 100+ validated iterations.

Defined search space — you specify which files are editable
Single objective — Veris pass rate drives the loop
Full audit trail — every iteration logged, safe by sandbox design

research_loop.py

while budget remaining: report = veris.run(scenarios) failures = veris.fetch_failures(report) if report.pass_rate > best: best = report.pass_rate # keep the win else: researcher.revert() # discard regression researcher.edit(prompt, config, failures)

Auto-Research Progress

94 iterations · best 89.2%