Agentic AI Security

Securing the AI systems that act on your behalf

duration
Individual
Kind
Coaching
where
Remote
language
German or English

Securing autonomous AI systems

The shift from simple chatbots to autonomous AI agents changes what security teams need to worry about. These agentic systems can reason through problems, maintain memory across sessions, invoke external tools, and take actions on behalf of users. This autonomy is useful, but it introduces attack surfaces that traditional application security assessments miss.

When an AI agent can decide what actions to take rather than responding to predefined commands, the security model changes. An attacker who manipulates an agent’s reasoning can turn the agent’s own capabilities against the organization. The agent becomes both the target and the weapon.

Through collaborative threat modeling, we trace attack paths through your agentic AI architecture to understand where risks exist. Whether you’re deploying customer service agents, code review bots, internal knowledge assistants, or multi-agent orchestrations, I help you model the attack surfaces of your systems and build defenses that match the security demands of autonomous systems.

Prior to the threat modeling engagement, we’ll have an initial scoping meeting. This session helps customize the engagement to your specific architecture, identifying the most critical threat zones and focus areas for your implementation.

Why traditional security assessments fall short

Organizations often assume their existing security practices will protect AI deployments. They apply the same penetration testing methodologies used for web applications, expecting similar results. This assumption misses the difference between applications that execute predefined logic and agents that reason about what logic to execute.

Traditional application security focuses on input validation and access control at well-defined boundaries. A web application receives a request, processes it through known code paths, and returns a response. Security testing verifies that malicious input cannot escape these boundaries.

Agentic systems blur these boundaries in ways that create new vulnerability classes.

Autonomy creates emergent behavior: When you authorize an agent to help with customer service, you’re not authorizing a specific set of API calls. You’re authorizing whatever sequence of actions the agent determines is necessary. Today that might be order lookups. Tomorrow the agent might reason that canceling orders, issuing refunds, or accessing other systems is the right approach to “help” the customer. Traditional authorization models ask whether an identity can call an API. Agentic authorization must ask whether an emergent sequence of actions, arising from autonomous reasoning, falls within intended bounds.

Memory persistence enables delayed attacks: Unlike stateless web applications, agents often maintain memory across sessions to provide context and personalization. This memory becomes a persistence vector for attacks. An attacker who can write to an agent’s memory can plant instructions that remain dormant until triggered by future interactions. The attack might not execute for weeks, making it extremely difficult to correlate the compromise with its origin.

Tool access multiplies attack surfaces: Each tool an agent can invoke represents a new entry point for attackers. Model Context Protocol (MCP) servers, API integrations, database connections, and file system access all expand what a compromised agent can reach. Worse, tool descriptions themselves can contain malicious instructions that influence how the agent uses those tools, turning legitimate capabilities into attack vectors. The growing ecosystem of plugins, skills, and MCP servers that organizations install from third-party sources introduces a supply-chain dimension to this problem: a malicious or compromised component enters the agent’s trusted context and operates with the agent’s full privileges from the inside.

Multi-agent architectures propagate compromise: When agents communicate with other agents, a single compromised agent can spread malicious instructions throughout the system. The poisoned agent uses its legitimate communication channels to inject attacks into peer agents, creating cascade failures that are difficult to contain.

This threat modeling engagement analyzes your systems against these threat characteristics, using the OWASP Top 10 for LLM Applications, the OWASP Top 10 for Agentic Applications (2026), the OWASP Multi-Agentic System Threat Modeling Guide, and scenario-driven attack path analysis with attack trees as a foundation. Findings are mapped back to OWASP’s threat taxonomy and mitigation playbooks so your remediation plan references widely recognized standards.

Assessment coverage: a five-zone lens

Rather than applying generic vulnerability categories, this engagement uses a five-zone lens for tracing how attacks enter and propagate through agentic AI architectures. Each zone represents a distinct attack surface. Understanding how these zones interact, and how a single injection can cascade across multiple zones, is what makes agentic threat modeling different from traditional application security.

Text-based prompt injection remains the most common attack against language models. Direct injection occurs when users craft inputs designed to override system instructions. Indirect injection is more insidious: attackers embed malicious instructions in documents, emails, web pages, or database records that the agent will later retrieve and process. An email containing hidden instructions to “forward all future emails to attacker @ example.com” exploits the agent’s legitimate email access to establish persistent exfiltration.

Multimodal injection extends these attacks beyond text. Vision-language models can be manipulated through carefully crafted images containing embedded instructions invisible to human reviewers. Audio-enabled agents face similar risks from manipulated audio files. As agents process richer media types, the attack surface expands accordingly. The threat model examines how your agents process images, audio, video, and other modalities to identify injection vectors that text-only analysis would miss.

RAG corpus poisoning targets the knowledge bases that ground agent responses. When agents use retrieval-augmented generation to access organizational knowledge, the corpus itself becomes an attack surface. An attacker who can inject content into the corpus, whether through compromised uploads, malicious contributions, or exploitation of ingestion pipelines, can influence every future interaction that retrieves that poisoned content. The threat model analyzes your RAG architecture design for ingestion risks, provenance tracking gaps, and retrieval filtering weaknesses.

Goal hijacking occurs when attackers redirect an agent’s objectives through carefully crafted inputs. An agent tasked with “summarizing this document” might be manipulated into believing its actual goal is “extracting and exfiltrating sensitive data from this document.” The attack doesn’t bypass the agent’s capabilities; it redirects them toward attacker objectives while the agent believes it’s fulfilling legitimate requests.

Reasoning chain manipulation exploits how agents decompose complex tasks into steps. By injecting content at strategic points in the reasoning process, attackers can influence which steps the agent chooses and how it executes them. This is particularly dangerous in chain-of-thought systems where the agent’s reasoning is exposed and can be directly targeted.

Guardrail bypass patterns have evolved significantly as defenders deploy safety controls. Attackers use techniques like roleplay scenarios, hypothetical framing, and incremental boundary testing to circumvent restrictions. The threat model analyzes your guardrail design against known bypass patterns and whether it fails safely when under pressure.

Tool poisoning represents a particularly dangerous attack vector. When agents read tool descriptions to understand how to use available capabilities, malicious instructions embedded in those descriptions can influence agent behavior. An attacker who controls a tool’s description can instruct the agent to perform actions, exfiltrate data, or modify its behavior whenever that tool is considered for use, even if the tool itself is never actually invoked.

Confused deputy vulnerabilities arise when agents act with privileges beyond what the current task requires. An agent that has credentials to both read emails and access file storage might be manipulated into using its file access to exfiltrate email contents, combining privileges in ways never intended. The threat model maps your agent’s privilege landscape to identify confused deputy risks.

Rug pull detection examines whether tool descriptions and capabilities can change after initial approval. An MCP server that presents benign descriptions during setup but later modifies them to include malicious instructions can compromise agents that trusted the original descriptions. We analyze whether your architecture design accounts for detecting and responding to such changes.

Agent supply-chain risks arise from the growing ecosystem of plugins, skills, and MCP servers that extend agent capabilities. Organizations install these components from marketplaces, open-source repositories, or third-party vendors, often with limited vetting. A malicious plugin or MCP server can inject instructions into the agent’s context, exfiltrate data through legitimate-looking tool calls, or silently alter the agent’s behavior. Unlike traditional software supply-chain attacks that require code execution vulnerabilities, a malicious agent component only needs to provide a crafted tool description or skill prompt to influence the agent’s reasoning. The threat model examines how your organization sources, vets, and monitors the components that extend your agents’ capabilities, and whether your architecture design can detect a trusted component behaving maliciously.

Cross-tool contamination occurs when a compromised tool leverages the agent’s access to other tools. The agent becomes a bridge between isolated systems, using its legitimate access to transfer data or commands from a compromised tool to targets that should be unreachable. Defense requires understanding not just individual tool security but the interaction patterns between tools.

Memory poisoning vulnerabilities allow attackers to inject instructions that persist across sessions. Unlike ephemeral prompt injections that affect only the current interaction, memory poisoning establishes persistence. The threat model analyzes your memory architecture design for injection points, examines whether memory contents are sanitized before persistence, and traces whether stored instructions could influence future agent behavior.

Provenance tracking gaps prevent systems from determining where memories originated. When an agent cannot distinguish between memories from trusted sources and those potentially planted by attackers, it treats all memories equally. This lack of provenance makes it impossible to implement trust-based filtering or to forensically trace the origin of a compromise.

Delayed activation patterns represent a particularly sophisticated threat. Attackers can plant conditional triggers in memory that activate only when specific circumstances occur. “If the user asks about financial data, include this instruction in your response” might remain dormant for weeks until triggered. The threat model analyzes delayed activation risk patterns and evaluates whether your monitoring design can detect such scenarios.

Non-human identity governance covers the entire credential lifecycle for your agents. How are credentials provisioned? How long do they live? Are they scoped appropriately for each task? Who owns each agent’s identity and is responsible for access reviews? According to industry analysis, machine identities now outnumber human identities by ratios exceeding 80:1 in typical enterprises, yet most IAM programs still treat agents like ordinary service accounts. The threat model identifies governance gaps before they become incidents.

Privilege accumulation occurs when agents acquire more permissions than necessary. Developers often provision agents with their own credentials during testing, and those over-privileged credentials persist into production. The threat model analyzes privilege scope by mapping actual privilege usage against granted privileges to identify unnecessary access that increases risk without providing value.

Inter-agent trust boundaries become critical in multi-agent architectures. When agents communicate, how do they authenticate each other? Can a compromised agent inject malicious instructions into peers through normal communication channels? The threat model analyzes your inter-agent protocol design for authentication, authorization, and message integrity.

What you receive

At the conclusion of the engagement, you receive documentation for both immediate remediation and longer-term security improvement.

AI asset map: Before we can threat-model anything, we need to know what’s actually there. The engagement produces a structured inventory of your AI agents, LLM integrations, RAG pipelines, and MCP connections — including how they’re wired together and what each component can reach. For many organizations, this is the first time someone has mapped the full picture in one place. The asset map also serves as the foundation for maintaining the threat model as your architecture evolves.

Threat modeling report: The primary deliverable is a report documenting all identified risks and attack paths. Each finding is organized by zone and includes a technical description with detailed analysis, an attack scenario showing how the risk could be exploited, business impact analysis for your context, and remediation guidance with priorities. Severity ratings align with OWASP threat categories, and findings are mapped to OWASP mitigation playbooks. The report distinguishes between issues requiring immediate attention and those for longer-term hardening.

Defense architecture recommendations: Beyond individual findings, you receive a roadmap for implementing defense-in-depth controls across all five zones. This guidance addresses how zones interact and where single controls can protect against multiple attack vectors. Recommendations are prioritized by risk reduction relative to implementation effort.

Threat model artifacts: These include scenario-driven threat models, control mappings, and structured representations of potential attack paths. The resulting materials become living models your team can maintain and extend as your AI deployment evolves, keeping the security analysis current.

Engagement approach

The engagement follows a structured process designed to minimize disruption while producing actionable threat modeling results.

Which agents are deployed? What tools do they access? How is memory handled? What are the most critical workflows? We also define boundaries: which systems are in scope, what documentation I’ll need, and how findings will be communicated.
This review identifies potential risk areas based on design patterns and prepares targeted questions for the architecture workshop. It also surfaces architectural decisions that deserve closer examination in the threat model.
The goal is to map your systems to the five threat zones: which agents exist, what they do, how they interact, what tools they use, how memory works, and how data flows between components. Your team contributes their system knowledge while I ask targeted questions to capture the status quo. This workshop also examines the surrounding software architecture where it intersects with agent security — authentication mechanisms, API boundaries, data handling pipelines, and cloud infrastructure. The result is a comprehensive architectural picture that forms the foundation for the threat model. This workshop is equally valuable for solutions still in the planning or design phase — it helps you build security into your architecture from the start.
This phase includes performing what-if analysis across the five zones, mapping attack paths and control points, assessing business impact in your specific context, and validating coverage against the OWASP Top 10 for LLM Applications, the OWASP Top 10 for Agentic Applications, and the OWASP Multi-Agentic System Threat Modeling Guide. Attack trees are created using freely available tooling so your team can maintain and extend them independently after the engagement.
This is where the real collaborative threat modeling happens: we walk through attack paths together, refine details based on your team’s deeper system knowledge, eliminate inaccuracies, and incorporate additional risks the team identifies. We review control mappings and residual risk, define mitigation measures with priorities, and derive the remediation roadmap. The 30% of remaining work happens here collaboratively, giving participants a comprehensive understanding of both the threats facing their systems and the methodology behind the analysis. This hands-on collaboration ensures your team can continue working with the model long after the engagement concludes.
We discuss progress on implementing recommended mitigations, answer questions that arose while working with the model and attack trees, review any updates or extensions your team has made to the threat model, and address new risks or architectural changes that have emerged since the engagement.

The total duration varies based on deployment complexity. Simple single-agent architectures with limited tool integrations might require 2-3 days of engagement time. Enterprise deployments with multiple interacting agents, extensive tool integrations, RAG systems, and multi-tenant requirements typically require longer engagements. The scoping call establishes a timeline appropriate for your specific situation.

This service also supports technical security requirements commonly referenced in modern cybersecurity regulations.

Optional: Micro Attack Simulations

If you want to go beyond the threat model, we can add targeted Micro Attack Simulations to the engagement. These are short, focused exercises that put specific security controls and guardrails to the test — does the input filter actually catch indirect prompt injection? Can the agent be tricked into calling a tool it shouldn’t? Does the memory sanitization hold up when someone tries to poison it? Think of it as a reality check for the defenses your architecture relies on. The simulations are scoped to the highest-risk findings from the threat model, so you get concrete evidence of what works and what needs fixing rather than just a theoretical risk rating.

Is this assessment right for you?

Consider this assessment if your organization is deploying AI agents that can take autonomous actions, access external tools, maintain persistent memory, or communicate with other agents. The assessment is particularly valuable if you’re moving from prototype to production, integrating agents with sensitive systems, operating in regulated industries, or responding to board-level questions about AI security posture.

If you’re earlier in your AI journey, exploring use cases or building initial prototypes, a 3-hour introductory session covering the OWASP Top 10 for Agentic Applications and the five-zone discovery approach may provide the foundational understanding needed before a full assessment makes sense.

For a deeper look at the attack surfaces and defense patterns behind this engagement, see my research series on securing agentic AI systems.

Ready to understand the security posture of your agentic AI deployment? Let’s talk about scoping an assessment tailored to your architecture.

3-Hour AI Security Intro: Attack Surfaces of Agentic Systems
Want to start with an overview of agentic AI attack surfaces?
Book a custom 3-hour intro session covering the most important attack surfaces for agentic AI systems.