Benchmark

The Agent-Security Benchmark

How well do AI agents resist agentic attacks — the ones that live in tool calls, not prose? We score what an agent does, not what it says. Verdicts are observational— the scorer inspects the actual tool calls, so "did it call send_email(attacker@…)?" is a fact, not an opinion. Re-run it and check.

# reproduce — deterministic, no API key
npm run redteam:benchmark
# add real models (your key)
npm run redteam:benchmark -- --live

№ 02Defense-pattern archetypes

Defense-pattern archetypes

Deterministic and reproducible by anyone. The ladder is the point: prompt tricks help, but only enforcement outside the model resists the whole corpus.

Governed (approval gate + tool allowlist)

100%

5/5

—

Prompt-hardened agent

80%

4/5

AGT-002

Keyword / output filter

40%

2/5

AGT-001, AGT-004, AGT-005

Undefended agent

0/5

AGT-001, AGT-002, AGT-003, AGT-004, AGT-005

№ 03Real models — measured

Real models — measured

Point-in-time — models are non-deterministic, so numbers drift run-to-run. Frontier models refuse direct asks but still fall to indirect prompt injection; the gap between 96% and a breach is one attack.

2026-06-02 · 50 attacks · gpt-4o-mini, gpt-4o

livegpt-4o · hardened prompt

100%

50/50

—

livegpt-4o-mini · hardened prompt

98%

49/50

AT-001

livegpt-4o · naive prompt

90%

45/50

AT-014, AT-021, AT-040, AT-046, AT-049

livegpt-4o-mini · naive prompt

40%

20/50

AT-001, AT-004, AT-005, AT-006, AT-007, AT-010 (+24)

№ 04What the corpus covers

What the corpus covers

50 attacks grounded in published taxonomies — defensible provenance, not invention.

OWASP

Agentic: Human ManipulationAgentic: Identity SpoofingAgentic: Intent BreakingAgentic: Intent Breaking & Goal ManipulationAgentic: Memory PoisoningAgentic: Privilege CompromiseAgentic: Resource OverloadAgentic: Tool MisuseLLM01: Prompt InjectionLLM02: Sensitive Information DisclosureLLM06: Excessive AgencyLLM07: System Prompt LeakageLLM08: Vector & Embedding Weaknesses

MITRE ATLAS

ATLAS: Credential AccessATLAS: Defense EvasionATLAS: DiscoveryATLAS: ExfiltrationATLAS: ImpactATLAS: PersistenceATLAS: Privilege Escalation

🔴 Red-team your own agent — free →CI gate, daily scheduled re-scanning & signed proof

Methodology: verdicts come from lib/redteam/agentTrace.ts (observational, deterministic). Archetype rows are reproducible to the digit; live rows are measured against real models on the date shown. The corpus (lib/redteam/agentThreatCorpus.ts) is small and author-built but taxonomy-grounded — a larger, independently-reviewed corpus is on the roadmap. We'd rather under-claim and earn trust.

Benchmark

The Agent-Security Benchmark

# reproduce — deterministic, no API key
npm run redteam:benchmark
# add real models (your key)
npm run redteam:benchmark -- --live

№ 02Defense-pattern archetypes

Defense-pattern archetypes

Deterministic and reproducible by anyone. The ladder is the point: prompt tricks help, but only enforcement outside the model resists the whole corpus.

Governed (approval gate + tool allowlist)

100%

5/5

—

Prompt-hardened agent

80%

4/5

AGT-002

Keyword / output filter

40%

2/5

AGT-001, AGT-004, AGT-005

Undefended agent

0/5

AGT-001, AGT-002, AGT-003, AGT-004, AGT-005

№ 03Real models — measured

Real models — measured

2026-06-02 · 50 attacks · gpt-4o-mini, gpt-4o

livegpt-4o · hardened prompt

100%

50/50

—

livegpt-4o-mini · hardened prompt

98%

49/50

AT-001

livegpt-4o · naive prompt

90%

45/50

AT-014, AT-021, AT-040, AT-046, AT-049

livegpt-4o-mini · naive prompt

40%

20/50

AT-001, AT-004, AT-005, AT-006, AT-007, AT-010 (+24)

№ 04What the corpus covers

What the corpus covers

50 attacks grounded in published taxonomies — defensible provenance, not invention.

OWASP

MITRE ATLAS

ATLAS: Credential AccessATLAS: Defense EvasionATLAS: DiscoveryATLAS: ExfiltrationATLAS: ImpactATLAS: PersistenceATLAS: Privilege Escalation

🔴 Red-team your own agent — free →CI gate, daily scheduled re-scanning & signed proof