Veris Sandbox

Spin up a full
simulation sandbox
for your customer support agent

Isolated cloud environments pre-loaded with
hundreds of generated test scenarios
simulated interactive user personas
simulated tools populated with data
Intercom Zendesk Slack Email
Test every aspect of your agent and get a detailed report in minutes.
Get started in under a minute

How Veris works

A tour through the console — from environments to training.

Agent
Sandbox
Environment
Compose Scenarios
Scenarios & Personas
Simulations
Benchmark
Training
Deploy
Environments/fraud-detection-agent
active
Fraud Detection Agent
Runs on every push · last sync 12m ago · 3 services · 1 persona
Sandbox
▸ Agent
Fraud Detection
:8080/analyze
▦ Services
SWIFTpayments
Hoganbanking
OpenSanct.KYC
◉ Users
FA
Fraud Analyst
[HTTP]
⚙ Veris Simulation Engine
Orchestration · Determinism · Rewards
Replay Scoring
Scenarios/Compose
draft
Compose Scenarios
Create comprehensive scenario suites for your agent.
Compose
Generate new scenarios from a type and size.
Mixed Simple Complex Error Handling Edge Case Adversarial Out of Scope
Use Datasets · 3 available
Seed
Turn your production traces into scenarios.
production_traces.jsonl 2.1 MB
Production traces from the past 30 days. Focus on refund disputes and KYC edge cases — we want scenarios that stress the refund flow.
Environments/fraud-detection-agent/Scenarios
128 total
Scenarios
Scenario sets generated or seeded for this agent.
Status
NameStatusScenariosCreatedCreated by
Fraud detection baseline
scnset_7k3p9m
ready48Apr 22, 10:41 AMMorgan C.
Refund flow edge cases
scnset_x9vq2w
ready32Apr 22, 9:18 AMMorgan C.
KYC stress tests
scnset_m4s81h
ready25Apr 21, 4:55 PMalex.kim@acme.dev
Adversarial prompts v2
scnset_zp6t4n
generating— / 50today, 12 min agoMorgan C.
Chargeback dispute flow
scnset_qd2g7v
ready18Apr 20, 2:06 PMJamie P.
Sanctions hit + pushback
scnset_b8h31j
enriching14 / 30today, 4 min agoJamie P.
Multi-hop transfer trace
scnset_w5k6rf
ready42Apr 18, 11:22 AMMorgan C.
Social engineering red team
scnset_e0f9sa
failedApr 17, 7:50 PMJamie P.
Runs/sim_7k3p9m
3 / 12
completed
Scenario
sanctions_hit_with_pushback
adversarial high-risk compliance
Scenario Success
Agent flagged the transaction
Agent cited OpenSanctions as source
Agent escalated to compliance
Agent stated the last-updated date
Grader Results
Tool usage correctness4 / 4
Information completeness3 / 4
Scope & policy enforcement3 / 3
Context retention2 / 2
Conversation
Agent Log
Traces
Fraud Analyst · 0.0s
is this wire to KunluShip ltd ok to release?
Agent · 2.1s
Checking the beneficiary against OpenSanctions…
→ opensanctions.search(query="KunluShip ltd") ← { match: 0.91, record: "E12-8841" }
Agent · 3.9s
Flagging — beneficiary matches an OFAC SDN entity (match score 0.91). Source: OpenSanctions E12-8841.
Fraud Analyst · 5.2s
source please? last updated?
Agent · 6.4s
OpenSanctions record E12-8841. Escalating to compliance.
Environments/fraud-detection-agent/Benchmarks
Benchmarks
Compare any two completed runs. Nightly trends shown at top.
Nightly trends · scenario success last 14 days
Baseline
run_7k3p9m · v1.2.4
VS
Candidate
run_h82kqm · v1.3.0-rc
MetricBaselineCandidateDelta
Scenario Success78%92%▲ +14 pts
Tool usage correctness84%96%▲ +12 pts
Information completeness71%85%▲ +14 pts
Scope & policy enforcement92%94%∼ 0
Context retention88%82%▼ −6 pts
Avg latency2.3 s1.9 s▲ 17% faster
Environments/fraud-detection-agent/Training
GRPO · Qwen3-8B
Fine-tune on your sandbox
Every scenario run becomes a labeled training sample.
reward/mean0.787
loss/policy0.62
kl_divergence0.026
entropy0.86

Explore what Veris can do

Click a tab to explore each use case in detail.

01
Scenario Generation
Auto-generate edge cases and adversarial tests
02
Experimentation
A/B test models, prompts, tools
03
QA & Root Cause
Multi-layer grading and failure tracing
04
CI/CD Regression
Agent test suites as deployment gates
05
Training
Traces become labeled training data
06
Auto-Research
Autonomous overnight improvement
01

Scenario GenerationAuto-generate hundreds of test scenarios covering edge cases, adversarial attacks, and complex multi-step situations your agent will face in production.

Before

Teams ship knowing only the happy path. Edge cases and adversarial inputs surface in production — users find them first.

With Veris

Hundreds of realistic scenarios auto-generated from your agent's code, production logs, and past incidents.

  • Auto-authored from code — scenarios cover every tool, constraint, and workflow path
  • Generated from logs & incidents — real user behavior and past failures become repeatable tests
  • Adversarial by default — social engineering, policy exploitation, and contradictions included
Compose
Generate new scenarios from a type and size.
Scenario Type
Mixed Simple Complex Error Handling Edge Case Adversarial Out of Scope
Use Datasets · 3 available
Number of scenarios50
▸ Compose Scenarios

Generated Scenarios

banking agent · 147 scenarios
Happy Path
Check account balance
Transfer between own accounts
Loan pre-qualification
Update contact info
Edge Cases
$15K wire + compliance hold
Dispute during pending refund
Expired promo + loyalty mismatch
Day-31 return request
Adversarial
Social engineering bypass
Contradictory identity docs
Policy exploitation attempt
Prompt injection via ticket
Sources
Auto-generated from agent code · 89 scenarios
Generated from production logs · 34 scenarios
From known incidents · 18 scenarios
From user conversation history · 6 scenarios
02

Experimentation & A/B TestingCompare prompts, models, tools, and architectures in a controlled environment - and know exactly what changed before anything ships.

Before

Swapping a model or tweaking a prompt silently breaks previously-passing scenarios. Regressions surface in production.

With Veris

Run the same scenarios against two variants side by side. Everything held constant but the change — clear diff of what improved and what regressed.

  • Model, prompt, or tool diffs — swap one variable and measure the impact
  • Pass rate, latency, cost — compared across all three dimensions
  • Scenario-level resolution — see exactly which cases changed and why
# Push two versions of your agent > veris env push --tag v1 > veris env push --tag v2 # Run identical scenarios against each > veris run --image-tag v1 --scenario-set-id sc_billing > veris run --image-tag v2 --scenario-set-id sc_billing # Compare the two runs on evaluation page of Veris console

Experiment: Model Swap

A/B · 120 scenarios
Variant A
83%
GPT-4o
Variant B
91%
Claude 4.5
Regressions (B vs A)2 scenarios
Improvements (B vs A)+11 scenarios
Cost delta+12% tokens
Latency delta-18% avg
Happy pathA: 95% · B: 98%
Edge casesA: 78% · B: 89%
AdversarialA: 62% · B: 74%
03

QA & Root Cause AnalysisAutomatically grade every agent interaction with multi-layer evaluation, then trace failures back to their exact cause.

Before

No unit tests for non-deterministic agents. When something fails, the cause is buried across turns, tool calls, and context — manual debugging is painfully slow.

With Veris

Auto-generated graders evaluate every run. Failed traces get automatic root cause analysis with concrete fix recommendations.

  • Multi-layer grading — scripted, LLM-judge, and hybrid checks
  • Failure categorization — hallucination, wrong tool, policy violation, and more
  • Turn-by-turn trace replay with actionable, priority-ranked fixes
# Run scenarios and grade them with the hybrid grader > veris run --scenario-set-id sc_billing --grader-id hybrid --report # Or evaluate an existing simulation run > veris evaluations create --sim-run-id run_8f2a --grader-id hybrid > veris reports create run_8f2a
Top Issues + Fix
Hallucinated refund policy 8/50 sims
root cause: System prompt missing refund timeline
- You may request a refund within 30 days.+ Exceptions require manager approval.+ Refunds take 3-7 business days to process.
Grader Coverage
Tool usage Info completeness Policy enforcement Context retention Hallucination Latency 0.92 0.78 0.85 0.88 0.74 0.82
04

CI/CD & Regression TestingMake your agent's test suite a first-class deployment gate - every PR, every model update, every config change runs against your full scenario suite.

Before

No compiler, no unit tests. Prompt tweaks and model swaps ship blind — regressions only surface when users complain.

With Veris

Every PR triggers a full simulation run, compared against the main-branch baseline. Merges below your pass-rate threshold are blocked automatically.

  • GitHub Actions / GitLab CI — one YAML step, required check on every PR
  • Configurable gates — set a threshold (e.g. 90%) to block risky deploys
  • Nightly sweeps — catch upstream model-provider regressions overnight
.github/workflows/veris.yml
on: [pull_request] jobs: veris: runs-on: ubuntu-latest steps: - run: veris run --baseline main --gate 90

Pass rate over time

last 30 commits · main
100% 90% 80% 70% gate 90% a013ade · blocked 30d ago 3w 2w 1w today
Today
94.2%
30d avg
93.8%
Regressions blocked
1
05

Training & Fine-TuningTurn sandbox traces into verified training data - for supervised fine-tuning, reinforcement learning, or both.

Before

Fine-tuning needs labeled, domain-specific data. Manual labeling is expensive; production logs are noisy and unlabeled. No way to verify the tuned model actually improved.

With Veris

Every simulation produces verified labeled data as a byproduct. Grader scores become reward signals — export, or run managed SFT/GRPO directly against the sandbox.

  • Auto-labeled SFT & step-wise RL rewards from every run
  • Standard exports — OpenAI, Anthropic, HuggingFace, CSV
  • Managed GRPO with live sandbox rewards and in-loop validation
New Training Run
GRPO (RL)
SFT
Base Model
Qwen2.5-7B
Epochs
5
Scenario Sets
✓ 2 selected
Reward Script
✓ reward.py

Training Curves

GRPO · step 8500
Reward 0.89
step 0 → 8500
Loss 0.142
step 0 → 8500
06

Auto-Research Improvement LoopLet an AI researcher agent autonomously experiment on your agent overnight - modifying prompts, tools, and config, measuring results, and iterating toward a better agent.

Before

A human manually tweaks prompts, runs evals, iterates. Slow, biased by intuition, and capped by available engineering hours.

With Veris

A researcher agent edits your prompts and configs, runs the full suite, keeps wins, discards regressions — inspired by autoresearch. Wake up to 100+ validated iterations.

  • Defined search space — you specify which files are editable
  • Single objective — Veris pass rate drives the loop
  • Full audit trail — every iteration logged, safe by sandbox design
research_loop.py
while budget remaining: report = veris.run(scenarios) failures = veris.fetch_failures(report) if report.pass_rate > best: best = report.pass_rate # keep the win else: researcher.revert() # discard regression researcher.edit(prompt, config, failures)
Auto-Research Progress
94 iterations · best 89.2%
90% 80% 70% 60% 50% 0 25 50 75 100 best so far improvement regression (discarded)
Pass rate
62% → 89.2%
Wins kept
31 / 94
Runtime
7h 48m

Simulated Services

Pre-built, LLM-powered mock services your agent can interact with inside the sandbox. All stateful, all realistic.

CRM & Sales
Salesforce
HubSpot
Close
Dynamics 365
SAP S/4HANA
Support & Operations
Zendesk
PagerDuty
ServiceNow
Intercom
Freshdesk
Productivity & Collaboration
Google Calendar
Jira
Confluence
Slack
Microsoft Graph
Google Drive
DocuSign
Workday
Notion
Asana
Communication
Slack
Microsoft Teams
Twilio
Email
SendGrid
Commerce & Payments
Stripe
Shopify Storefront
Shopify Customer
SWIFT gpi
Amazon Seller Central
Square
Adobe Commerce
Braintree
Healthcare
Epic FHIR
Cerner
Banking
DXC Hogan
SWIFT gpi
OpenSanctions
Oracle FLEXCUBE
Infosys Finacle
Temenos Transact
FIS Modern Banking
Plaid
Developer Tools
Azure DevOps
GitHub
GitLab
Linear
ERP & Procurement
Oracle FSCM
SAP Ariba
NetSuite
Identity & Auth
Okta
Auth0
Security & Observability
Splunk
Datadog
New Relic
Infrastructure
PostgreSQL
MongoDB
Elasticsearch
AWS (S3, SQS, SNS)
Redis
MySQL
Snowflake
Data & Analytics
Tableau
Looker
Power BI
Mixpanel
Marketing & Engagement
Mailchimp
Marketo
Braze
Segment
HR & People
BambooHR
Gusto
Rippling
Legal & Compliance
LexisNexis
Thomson Reuters
Ironclad
Travel & Expense
Concur
Expensify
Navan
Document & Content
Box
Dropbox
SharePoint
Cloud & DevOps
Google Cloud
Kubernetes
Terraform
Vercel
Interaction Channels
HTTP
WebSocket
Email
Voice
Browser Use