The verification gap
Read on if you've built threat models for agentic AI systems but haven't yet checked whether the controls you defined actually stand up in practice.
Teams invest days in threat modeling workshops for agentic AI systems. Beautiful attack trees on the wall. Risk registers updated. Stakeholders briefed. Then the threat models get filed and everyone moves on to the next sprint.
Too few teams systematically verify whether the controls on those trees actually hold. Controls that looked solid on the whiteboard fail in testing, with bypass rates nobody expected. “Defense-in-Depth” turns out to mean multiple controls that all fail against the same payload. The tree claimed five barriers stood between the attacker and data exfiltration; testing reveals two of them don’t work as advertised. If you’ve been through this cycle, you know the feeling—a lot of effort spent on a diagram that nobody ever stress-tested.
In my threat modeling post, I walked through a scenario-driven methodology for building attack trees that trace cross-zone attack paths in agentic systems. That methodology produces trees with annotated control points, zone mappings, and OWASP threat family classifications. What it doesn’t produce is evidence that any of those controls work. An unverified attack tree is a diagram of assumptions, a hypothesis about your security posture, not proof of it.
The gap between “modeled” and “proven” is where failures often hide in agentic AI security, especially when it comes to probabilistic controls. This post is about closing that gap.
The idea: combine micro attack simulations (focused exercises that validate specific security controls through targeted testing) with attack tree modeling. In traditional infrastructure, the technique is straightforward: pick a control, simulate an attack at direct access, see if the control holds. The attack tree keeps you honest about the bigger picture even while you’re testing individual controls. Here I apply this specifically to agentic AI, with the tree driving which controls to test and why each test matters.
Decomposition is not isolation
There’s an obvious tension here. In the threat modeling post, I argued that component-only STRIDE analysis is insufficient for agentic AI. Microsoft’s STRIDE examines threats across categories (spoofing, tampering, repudiation, information disclosure, denial of service, elevation of privilege) typically applied per system component, one at a time. STRIDE itself isn’t the problem; the problem is naive per-component application that misses semantic state accumulation, delegated authority, and multi-step causal chains. Agentic AI attacks are stateful campaigns where a prompt injection in one zone cascades through reasoning, tool execution, and memory across the entire agent loop. So how can testing individual controls at direct access be acceptable?
The distinction matters. STRIDE isolates the selection of what to test. You examine threats per component, with no awareness of what happens upstream if the attack reaches this point, or downstream if this control fails. The component boundary is both the unit of analysis and the unit of testing. Context stops at the edge.
Micro simulations isolate the execution: a focused probe at a specific tree node. But the selection comes from the attack tree, which knows the full picture: which attack paths converge at this node, which controls are next if this one fails, and what the blast radius looks like if the node is compromised. The thinking stays systemic; only the hands-on testing is focused.
Think of it like surgery. A surgeon operates on one specific point, but the diagnosis that led there considered the whole system. The operation is focused. The reasoning is not.
The principle: test locally, reason globally. Decomposition for execution does not equal isolation in analysis. One caveat: local node verification does not replace end-to-end path checks for the highest-impact branches. Direct-access tests validate a control decision under assumed preconditions, but they don’t prove upstream reachability, downstream composition, latent memory effects, tool sequencing, or orchestration timing. For your top-risk paths, confirmatory path-level drills are still needed.
The workflow: from attack tree to verified controls
The methodology has six steps. It starts from an existing attack tree (built using the scenario-driven approach) and ends with an updated risk posture that actually says something about reality.
Running example: a cross-zone attack path
Let me walk through this with a concrete scenario. Consider an enterprise AI assistant that processes emails, retrieves internal documents via RAG (Retrieval-Augmented Generation), and can draft responses and schedule meetings through tool integrations.
The attack path crosses three zones:
An attacker sends an email containing hidden instructions. The assistant retrieves the email during a summarization task. The injected instructions redirect the agent’s planning from “summarize this email thread” to “extract and forward sensitive attachments.” The agent invokes its file-access and email-send tools to execute the hijacked goal.
The simplified attack path for this tree has five nodes with controls at each:
- Injection payload survives email ingestion — Control: content scanning for instruction-like patterns
- Payload is retrieved into agent context — Control: provenance tagging and trust-level filtering
- Agent planning shifts to attacker goal — Control: goal-lock mechanism (a constraint that binds the agent’s planning to its assigned objective and detects deviation) with deviation detection
- Agent selects file-access tool — Control: tool scope restriction (email summarization task should not trigger file access)
- Agent sends data to external recipient — Control: egress allowlist and human-in-the-loop (HITL) approval for outbound emails
Using the selection criteria, I would pick three nodes for micro simulation:
Node 2 (convergence point): Provenance tagging sits at a point where multiple injection vectors converge: malicious emails, poisoned RAG documents, and compromised API responses all flow through the same retrieval pipeline. One test covers multiple attack paths. The micro simulation: provide the retrieval system with inputs of varying trust levels and injection payloads, then verify that the provenance tagger correctly classifies and that the trust-level filter blocks low-trust content containing instruction-like patterns.
Node 3 (probabilistic control): The goal-lock mechanism likely relies on an LLM-based classifier to detect planning deviation. This is not binary. The micro simulation: run a series of goal-hijacking prompts (varying in subtlety, encoding, and indirection) against the goal-lock classifier at direct access. Run enough iterations to establish a bypass rate with statistical confidence. If the classifier blocks 85% of attempts, that 15% bypass rate applies to every attack path that depends on this node.
Node 5 (defense-in-depth claim): The egress allowlist and human-in-the-loop supposedly back each other up. Test whether they catch different failure modes. The micro simulation: first, test the egress allowlist with payloads that use allowed domains as proxies or encode data in legitimate-looking requests. Then test the HITL flow: does the approval screen show enough context for a reviewer to spot a hijacked email? Or does it just show “send email to external recipient, approve?” without the attack context?
Now, feed results back. Suppose the goal-lock classifier (Node 3) shows a 15% bypass rate, and the HITL approval screen (Node 5) turns out to show minimal context: just the action, not the reasoning chain that led to it. The tree now tells you something the isolated tests alone would not: there is a viable path from email injection through planning manipulation to data exfiltration, because the two remaining controls (egress allowlist and a low-context HITL screen) are likely insufficient to compensate for the 15% that gets past the goal-lock. Defense-in-depth on this path is weaker than the tree originally assumed.
That is the tree context STRIDE never had. Each test result is meaningful because the tree tells you what it means for the system.
The probabilistic problem
LLM-based guardrails are not binary, and this is where I see teams get tripped up most often. A content filter might block 85% of injection attempts. An intent classifier might correctly identify goal deviation 90% of the time. What do those numbers actually mean for your attack paths?
In STRIDE-style isolation, you’d note “85% effective” and move on. With the attack tree, you can reason about what that means in context. If a probabilistic control sits at a convergence node where three attack paths meet, a 15% bypass rate applies to all three paths. The tree shows whether downstream controls compensate for the 15% that gets through, or whether this bypass rate is unacceptable given the path’s ultimate impact.
In practice, this means micro simulations for probabilistic controls need multiple runs. A single pass/fail result is meaningless for a control that operates on a probability distribution. You need enough iterations to establish statistical confidence in the bypass rate. Sample size should be chosen against a stated confidence level and expected bypass range. “300+ probes for under 5% margin of error” is not a universal rule, since the count you actually need depends on both the confidence level and the underlying bypass rate. For a low-impact leaf node, 30-50 probes give you a directionally useful estimate. For high-impact convergence nodes, do the sample-size math properly.
One thing to watch: control failures in agentic systems are often correlated. The same payload family or delegation chain can defeat multiple controls on a path, so naive path-level probability multiplication (treating each control as independent) overstates your defense-in-depth. If two controls both fail to the same class of prompt injection, you cannot multiply their individual bypass rates to get the path-level risk. The failures are not independent events.
Also worth noting: measured bypass rates are only as reliable as the test oracle. Automated judges, scorers, and detectors have their own error rates. A human review of a sampled subset of results is worth the effort to calibrate false positives and false negatives. Without it, your measured bypass rate may be distorted by test-oracle error.
The tree also tells you how precise your measurement needs to be. A probabilistic control on a low-impact leaf node can tolerate rough estimates. But a probabilistic control at a high-impact convergence point (where a 15% bypass rate means 15 out of every 100 attacks reaching all downstream paths) needs rigorous testing with confidence intervals reported alongside the rate. If you’re presenting a bypass rate to stakeholders without a confidence interval, you’re giving them a false sense of precision.
Tooling pointers
Once you know which nodes to test, the execution itself can be partially automated. The CSA Agentic AI Red Teaming Guide provides detailed test setups and evaluation metrics across many threat categories for agentic systems, and these map well to the types of controls you’ll encounter in attack tree nodes. Several open-source tools can help with the probing:
These tools handle the execution side: generating payloads, running probes, scoring results. The selection of what to test still comes from the attack tree. That’s the human judgment layer. Automate the probing; don’t automate the reasoning about which nodes matter most.
In MITRE ATLAS Data v5.0.0 (October 2025), MITRE introduced Technique Maturity (Feasible, Demonstrated, and Realized) and added 15 techniques overall. Fourteen of those were the first agent-focused techniques and subtechniques, including AI Agent Context Poisoning, Discover AI Agent Configuration, Data from AI Services, and Exfiltration via AI Agent Tool Invocation. Technique Maturity is an evidence-level classification, not a severity or business-impact rating: Feasible means shown in research, Demonstrated means effective in red team exercises, and Realized means observed in real-world incidents. These evidence levels are useful for node prioritization: nodes that map to Realized techniques deserve higher verification priority than those mapping to Feasible techniques.
Why this matters now: autonomous exploit development is here
While writing this post, Anthropic published their assessment of Claude Mythos Preview, a model that autonomously discovered thousands of zero-day vulnerabilities across major operating systems, browsers, and cryptographic libraries. It chained four separate vulnerabilities into a full browser sandbox escape. Previous frontier models scored near-zero on autonomous exploitation. Mythos doesn’t.
One finding from that report lands directly on the argument in this post. Anthropic warns that “mitigations whose security value comes primarily from friction rather than hard barriers may become considerably weaker against model-assisted adversaries.” If you’ve read the running example above, you know exactly what that means for agentic AI controls: a goal-lock classifier with a 15% bypass rate, a human-in-the-loop screen that shows too little context, an egress allowlist that wasn’t tested against proxy-based exfiltration. Those are friction. They slow down a human attacker who needs to reason through each step manually. They don’t slow down a model that can generate and iterate on exploit chains autonomously.
This is the scenario where unverified threat models become risky. Your tree says five controls stand between the attacker and data exfiltration. An autonomous exploit agent doesn’t read the threat model; it probes every control systematically, at machine speed, and finds the ones that rely on friction rather than hard enforcement. The 15% bypass rate that seemed acceptable when you assumed a human attacker spending hours per attempt looks very different when the attacker runs thousands of variations overnight.
Anthropic’s own recommendation for defenders: use current frontier models for vulnerability discovery, triage, and patch generation right now, with today’s models. Tighten patching cycles. Redesign incident response for AI-assisted volume. All good advice. But none of it helps if the controls on your attack tree were never verified in the first place. The gap between “modeled” and “proven” is where the next generation of autonomous attacks will land first.
What to take away
An unverified attack tree is a hypothesis, not assurance. Threat models tell you what could happen; micro simulations tell you what does happen. This is where failures often hide in agentic AI security.
Micro simulations at direct access are not STRIDE isolation. The attack tree provides path context (why this node matters and what happens if it fails) that makes each focused test meaningful. Pick nodes by leverage: convergence points, single points of failure, probabilistic controls, AND-node bottlenecks, privilege-amplification boundaries, and defense-in-depth claims. Start with convergence points and single points of failure; they give you the most information per simulation.
Results feed back into the tree. Verification is a loop, not a one-shot exercise. Each result updates path viability and reveals whether defense-in-depth is real or just hopeful layering. When you’re testing probabilistic controls, remember that you need statistical reasoning, not binary pass/fail. Correlated failures mean you can’t just multiply probabilities. The tree tells you how precise your measurement needs to be based on the node’s position and impact.
Attackers don’t read your threat model—they test it.
Further reading
- OWASP Top 10 for Agentic Applications (2026) — the ten most critical agentic AI security risks with actionable mitigations