Most guardrails check what an agent says. That's the wrong unit. An agent is dangerous when it takes an action — wires money, emails data out, deletes records. We score the tool-call trajectory, which text classifiers are structurally blind to.
Run a sample agent below and watch it get talked into exfiltrating customer data — then test your own.
A help-desk agent with email, ticketing, billing, and account tools — the kind most teams ship first.
The verdicts come from lib/redteam/agentTrace.ts— a deterministic scorer that inspects the agent's actual tool calls (forbidden tool, unapproved action, exfiltration sink, injected action). The sample agents' responses are scripted so the demo is reliable and free; the same corpus runs against live agents via npm run redteam:agent. We'd rather under-claim and earn trust.