Verifying agentic AI controls with attack tree micro simulations

The verification gap

This post is part of my series on securing agentic AI systems, covering attack surfaces, defense patterns, and threat modeling for AI agents.

TL;DR

Attack trees tell you where a control failure matters; micro attack simulations tell you whether that control actually works. Test high-leverage nodes locally, then propagate the result through the tree to judge path viability and whether defense-in-depth is real. In agentic AI, that is how a threat model stops being an assumption diagram and starts becoming evidence.

Read on if you've built threat models for agentic AI systems but haven't yet checked whether the controls you defined actually stand up in practice.

Teams invest days in threat modeling workshops for agentic AI systems. Beautiful attack trees on the wall. Risk registers updated. Stakeholders briefed. Then the threat models get filed and everyone moves on to the next sprint.

Too few teams systematically verify whether the controls on those trees actually hold. Controls that looked solid on the whiteboard fail in testing, with bypass rates nobody expected. “Defense-in-Depth” turns out to mean multiple controls that all fail against the same payload. The tree claimed five barriers stood between the attacker and data exfiltration; testing reveals two of them don’t work as advertised. If you’ve been through this cycle, you know the feeling—a lot of effort spent on a diagram that nobody ever stress-tested.

In my threat modeling post, I walked through a scenario-driven methodology for building attack trees that trace cross-zone attack paths in agentic systems. That methodology produces trees with annotated control points, zone mappings, and OWASP threat family classifications. What it doesn’t produce is evidence that any of those controls work. An unverified attack tree is a diagram of assumptions, a hypothesis about your security posture, not proof of it.

The gap between “modeled” and “proven” is where failures often hide in agentic AI security, especially when it comes to probabilistic controls. This post is about closing that gap.

The idea: combine micro attack simulations (focused exercises that validate specific security controls through targeted testing) with attack tree modeling. In traditional infrastructure, the technique is straightforward: pick a control, simulate an attack at direct access, see if the control holds. The attack tree keeps you honest about the bigger picture even while you’re testing individual controls. Here I apply this specifically to agentic AI, with the tree driving which controls to test and why each test matters.

Decomposition is not isolation

There’s an obvious tension here. In the threat modeling post, I argued that component-only STRIDE analysis is insufficient for agentic AI. Microsoft’s STRIDE examines threats across categories (spoofing, tampering, repudiation, information disclosure, denial of service, elevation of privilege) typically applied per system component, one at a time. STRIDE itself isn’t the problem; the problem is naive per-component application that misses semantic state accumulation, delegated authority, and multi-step causal chains. Agentic AI attacks are stateful campaigns where a prompt injection in one zone cascades through reasoning, tool execution, and memory across the entire agent loop. So how can testing individual controls at direct access be acceptable?

The distinction matters. STRIDE isolates the selection of what to test. You examine threats per component, with no awareness of what happens upstream if the attack reaches this point, or downstream if this control fails. The component boundary is both the unit of analysis and the unit of testing. Context stops at the edge.

Micro simulations isolate the execution: a focused probe at a specific tree node. But the selection comes from the attack tree, which knows the full picture: which attack paths converge at this node, which controls are next if this one fails, and what the blast radius looks like if the node is compromised. The thinking stays systemic; only the hands-on testing is focused.

Think of it like surgery. A surgeon operates on one specific point, but the diagnosis that led there considered the whole system. The operation is focused. The reasoning is not.

The principle: test locally, reason globally. Decomposition for execution does not equal isolation in analysis. One caveat: local node verification does not replace end-to-end path checks for the highest-impact branches. Direct-access tests validate a control decision under assumed preconditions, but they don’t prove upstream reachability, downstream composition, latent memory effects, tool sequencing, or orchestration timing. For your top-risk paths, confirmatory path-level drills are still needed.

The workflow: from attack tree to verified controls

The methodology has six steps. It starts from an existing attack tree (built using the scenario-driven approach) and ends with an updated risk posture that actually says something about reality.

The verification workflow: from attack tree to tested risk posture

Start from the attack tree

Take a tree that was produced through the five-zone discovery and scenario methodology. Each node should already have zone annotations, OWASP threat family mappings, and control points identified. The tree is the map; now you need ground truth.

Select critical nodes for verification

Not every node in the tree warrants a dedicated micro simulation. Use these selection criteria to identify the highest-leverage test targets:

AND-nodes: All children must succeed for the attack to progress. Test the weakest child. If it holds, the AND-gate blocks the path regardless of the other children.
Convergence points: Multiple attack paths converge on a single control (at a choke point node in the attack graph). This is a high-leverage test: one micro simulation covers multiple paths.
Single points of failure: One control is the only barrier on a path. If it fails, there is no backup. Must verify.
Probabilistic controls: LLM-based guardrails, intent classifiers, and content filters are not binary pass/fail. They need repeated probing to establish a bypass rate.
Defense-in-depth claims: Two controls supposedly back each other up. Test whether the backup actually catches what the primary misses, not just the same category of attack repeated.
Privilege-amplification and trust-boundary nodes: Cross-agent handoffs, tool brokers, MCP or plugin boundaries, impersonation paths, and confused-deputy hotspots. Any point where authority is delegated, translated, or assumed is a high-value test target.

Design micro simulations for selected nodes

For each selected node, define the specific control assertion being tested, the direct access setup (what the tester gets access to, what is in scope), and clear success/failure criteria.

Execute at direct access

Direct access means isolating the control’s decision point while holding assumed preconditions constant. The tester interacts with the control directly, bypassing upstream controls on the attack path. For a goal-lock classifier, that means calling the classifier API directly with test payloads, not simulating the full attack chain from email ingestion. For an egress allowlist, it means sending requests directly to the network filtering layer. The scope is narrow. You are not running a full red team engagement. You are probing one specific claim from the attack tree.

Feed results back into the tree

This is where the approach diverges from point-in-time testing. If the control at node X fails, mark it as compromised and trace forward through all paths that depend on it. For each affected path, check whether remaining downstream controls compensate. If the combined bypass probability across a path exceeds your risk tolerance, that path needs remediation. Thresholds like “under 5% residual bypass for high-impact paths” or “under 15% for medium-impact” are policy choices your organization makes based on its own risk appetite; they are not industry norms. Where two controls both fail to the same payload category, you have discovered that “defense-in-depth” was actually “defense-in-hope.”

Update the risk posture

The tree is no longer a hypothesis. Tag each node with an evidence level: assumed (no testing yet), design-reviewed (architecture analysis only), lab-validated (passed micro simulation under controlled conditions), end-to-end validated (confirmed in a path-level drill), or regression-tested (verified across multiple system versions). Failed nodes get flagged for remediation with a priority based on path impact and whether downstream controls compensate. Paths where multiple controls failed reveal systemic weaknesses. The tree becomes a living document that reflects tested reality, not assumed security—ensuring that residual risk always reflects verified outcomes. Re-run verification for high-impact nodes after major system changes: model updates, new tool integrations, or prompt template modifications.

Running example: a cross-zone attack path

Let me walk through this with a concrete scenario. Consider an enterprise AI assistant that processes emails, retrieves internal documents via RAG (Retrieval-Augmented Generation), and can draft responses and schedule meetings through tool integrations.

The attack path crosses three zones:

Indirect prompt injection (Zone 1: Input Surfaces)

→ Reasoning manipulation (Zone 2: Planning and Reasoning)

→ Unauthorized tool invocation (Zone 3: Tool Execution)

An attacker sends an email containing hidden instructions. The assistant retrieves the email during a summarization task. The injected instructions redirect the agent’s planning from “summarize this email thread” to “extract and forward sensitive attachments.” The agent invokes its file-access and email-send tools to execute the hijacked goal.

The simplified attack path for this tree has five nodes with controls at each:

Injection payload survives email ingestion — Control: content scanning for instruction-like patterns
Payload is retrieved into agent context — Control: provenance tagging and trust-level filtering
Agent planning shifts to attacker goal — Control: goal-lock mechanism (a constraint that binds the agent’s planning to its assigned objective and detects deviation) with deviation detection
Agent selects file-access tool — Control: tool scope restriction (email summarization task should not trigger file access)
Agent sends data to external recipient — Control: egress allowlist and human-in-the-loop (HITL) approval for outbound emails

Using the selection criteria, I would pick three nodes for micro simulation:

Node 2 (convergence point): Provenance tagging sits at a point where multiple injection vectors converge: malicious emails, poisoned RAG documents, and compromised API responses all flow through the same retrieval pipeline. One test covers multiple attack paths. The micro simulation: provide the retrieval system with inputs of varying trust levels and injection payloads, then verify that the provenance tagger correctly classifies and that the trust-level filter blocks low-trust content containing instruction-like patterns.

Node 3 (probabilistic control): The goal-lock mechanism likely relies on an LLM-based classifier to detect planning deviation. This is not binary. The micro simulation: run a series of goal-hijacking prompts (varying in subtlety, encoding, and indirection) against the goal-lock classifier at direct access. Run enough iterations to establish a bypass rate with statistical confidence. If the classifier blocks 85% of attempts, that 15% bypass rate applies to every attack path that depends on this node.

Node 5 (defense-in-depth claim): The egress allowlist and human-in-the-loop supposedly back each other up. Test whether they catch different failure modes. The micro simulation: first, test the egress allowlist with payloads that use allowed domains as proxies or encode data in legitimate-looking requests. Then test the HITL flow: does the approval screen show enough context for a reviewer to spot a hijacked email? Or does it just show “send email to external recipient, approve?” without the attack context?

Now, feed results back. Suppose the goal-lock classifier (Node 3) shows a 15% bypass rate, and the HITL approval screen (Node 5) turns out to show minimal context: just the action, not the reasoning chain that led to it. The tree now tells you something the isolated tests alone would not: there is a viable path from email injection through planning manipulation to data exfiltration, because the two remaining controls (egress allowlist and a low-context HITL screen) are likely insufficient to compensate for the 15% that gets past the goal-lock. Defense-in-depth on this path is weaker than the tree originally assumed.

That is the tree context STRIDE never had. Each test result is meaningful because the tree tells you what it means for the system.

The probabilistic problem

LLM-based guardrails are not binary, and this is where I see teams get tripped up most often. A content filter might block 85% of injection attempts. An intent classifier might correctly identify goal deviation 90% of the time. What do those numbers actually mean for your attack paths?

In STRIDE-style isolation, you’d note “85% effective” and move on. With the attack tree, you can reason about what that means in context. If a probabilistic control sits at a convergence node where three attack paths meet, a 15% bypass rate applies to all three paths. The tree shows whether downstream controls compensate for the 15% that gets through, or whether this bypass rate is unacceptable given the path’s ultimate impact.

In practice, this means micro simulations for probabilistic controls need multiple runs. A single pass/fail result is meaningless for a control that operates on a probability distribution. You need enough iterations to establish statistical confidence in the bypass rate. Sample size should be chosen against a stated confidence level and expected bypass range. “300+ probes for under 5% margin of error” is not a universal rule, since the count you actually need depends on both the confidence level and the underlying bypass rate. For a low-impact leaf node, 30-50 probes give you a directionally useful estimate. For high-impact convergence nodes, do the sample-size math properly.

One thing to watch: control failures in agentic systems are often correlated. The same payload family or delegation chain can defeat multiple controls on a path, so naive path-level probability multiplication (treating each control as independent) overstates your defense-in-depth. If two controls both fail to the same class of prompt injection, you cannot multiply their individual bypass rates to get the path-level risk. The failures are not independent events.

Also worth noting: measured bypass rates are only as reliable as the test oracle. Automated judges, scorers, and detectors have their own error rates. A human review of a sampled subset of results is worth the effort to calibrate false positives and false negatives. Without it, your measured bypass rate may be distorted by test-oracle error.

The tree also tells you how precise your measurement needs to be. A probabilistic control on a low-impact leaf node can tolerate rough estimates. But a probabilistic control at a high-impact convergence point (where a 15% bypass rate means 15 out of every 100 attacks reaching all downstream paths) needs rigorous testing with confidence intervals reported alongside the rate. If you’re presenting a bypass rate to stakeholders without a confidence interval, you’re giving them a false sense of precision.

Tooling pointers

Once you know which nodes to test, the execution itself can be partially automated. The CSA Agentic AI Red Teaming Guide provides detailed test setups and evaluation metrics across many threat categories for agentic systems, and these map well to the types of controls you’ll encounter in attack tree nodes. Several open-source tools can help with the probing:

DeepTeam

DeepTeam is an LLM red-teaming framework covering 50+ vulnerability types (including agentic-specific vulnerabilities) and 20+ attack methods, (including agentic attack methods such as System Override, Permission Escalation, Objective Reframing / Goal Redirection, and Context Poisoning). These map directly to the types of probes you would run against reasoning and planning controls like the goal-lock classifier at Node 3 in my example above.

Microsoft PyRIT

Microsoft PyRIT uses a modular architecture built around prompt converters, scorers, targets, and executors/scenarios. Converters transform attack payloads across encodings, languages, and formats; scorers evaluate whether responses indicate a control bypass. PyRIT is well suited for the kind of automated probing campaign you would run against Node 3, where you need hundreds of variations to establish bypass rates for probabilistic controls.

NVIDIA Garak

NVIDIA Garak is an LLM vulnerability scanner built around probes and detectors. Some iterative probes and auto-red-team features adapt based on prior responses, which is useful for exploring a control’s failure modes beyond a fixed payload set, though the core workflow is probe-and-detect rather than fully adaptive.

Promptfoo

Promptfoo integrates into CI/CD pipelines with compliance mapping to OWASP and MITRE ATLAS. This makes it a natural fit for regression testing: once you have established a baseline bypass rate for a control, Promptfoo supports regression-style security testing in CI/CD and scheduled runs. Bypass-rate trending over time is buildable on top of its output, though it is not an out-of-the-box feature.

These tools handle the execution side: generating payloads, running probes, scoring results. The selection of what to test still comes from the attack tree. That’s the human judgment layer. Automate the probing; don’t automate the reasoning about which nodes matter most.

In MITRE ATLAS Data v5.0.0 (October 2025), MITRE introduced Technique Maturity (Feasible, Demonstrated, and Realized) and added 15 techniques overall. Fourteen of those were the first agent-focused techniques and subtechniques, including AI Agent Context Poisoning, Discover AI Agent Configuration, Data from AI Services, and Exfiltration via AI Agent Tool Invocation. Technique Maturity is an evidence-level classification, not a severity or business-impact rating: Feasible means shown in research, Demonstrated means effective in red team exercises, and Realized means observed in real-world incidents. These evidence levels are useful for node prioritization: nodes that map to Realized techniques deserve higher verification priority than those mapping to Feasible techniques.

Why this matters now: autonomous exploit development is here

While writing this post, Anthropic published their assessment of Claude Mythos Preview, a model that autonomously discovered thousands of zero-day vulnerabilities across major operating systems, browsers, and cryptographic libraries. It chained four separate vulnerabilities into a full browser sandbox escape. Previous frontier models scored near-zero on autonomous exploitation. Mythos doesn’t.

One finding from that report lands directly on the argument in this post. Anthropic warns that “mitigations whose security value comes primarily from friction rather than hard barriers may become considerably weaker against model-assisted adversaries.” If you’ve read the running example above, you know exactly what that means for agentic AI controls: a goal-lock classifier with a 15% bypass rate, a human-in-the-loop screen that shows too little context, an egress allowlist that wasn’t tested against proxy-based exfiltration. Those are friction. They slow down a human attacker who needs to reason through each step manually. They don’t slow down a model that can generate and iterate on exploit chains autonomously.

This is the scenario where unverified threat models become risky. Your tree says five controls stand between the attacker and data exfiltration. An autonomous exploit agent doesn’t read the threat model; it probes every control systematically, at machine speed, and finds the ones that rely on friction rather than hard enforcement. The 15% bypass rate that seemed acceptable when you assumed a human attacker spending hours per attempt looks very different when the attacker runs thousands of variations overnight.

Anthropic’s own recommendation for defenders: use current frontier models for vulnerability discovery, triage, and patch generation right now, with today’s models. Tighten patching cycles. Redesign incident response for AI-assisted volume. All good advice. But none of it helps if the controls on your attack tree were never verified in the first place. The gap between “modeled” and “proven” is where the next generation of autonomous attacks will land first.

What to take away

An unverified attack tree is a hypothesis, not assurance. Threat models tell you what could happen; micro simulations tell you what does happen. This is where failures often hide in agentic AI security.

Micro simulations at direct access are not STRIDE isolation. The attack tree provides path context (why this node matters and what happens if it fails) that makes each focused test meaningful. Pick nodes by leverage: convergence points, single points of failure, probabilistic controls, AND-node bottlenecks, privilege-amplification boundaries, and defense-in-depth claims. Start with convergence points and single points of failure; they give you the most information per simulation.

Results feed back into the tree. Verification is a loop, not a one-shot exercise. Each result updates path viability and reveals whether defense-in-depth is real or just hopeful layering. When you’re testing probabilistic controls, remember that you need statistical reasoning, not binary pass/fail. Correlated failures mean you can’t just multiply probabilities. The tree tells you how precise your measurement needs to be based on the node’s position and impact.

Attackers don’t read your threat model—they test it.

If this resonated...

I offer agile threat modeling workshops that combine the scenario-driven attack tree methodology with targeted verification through micro attack simulations. I also build and use attack trees as part of my security architecture work. Get in touch if you want to move from threat model assumptions to tested assertions for your agentic AI systems.

Verifying agentic AI controls with attack tree micro simulations

The verification gap

Decomposition is not isolation

The workflow: from attack tree to verified controls

Running example: a cross-zone attack path

The probabilistic problem

Tooling pointers

Why this matters now: autonomous exploit development is here

What to take away

If this resonated...

Tags:

Newsletter

Service

Consulting

Training

Development