Memory poisoning in AI agents: exploits that wait

From session attacks to persistent compromise

TL;DR

Memory poisoning plants instructions into an AI agent’s memory that survive across sessions and execute days or weeks later, triggered by unrelated interactions. Unlike prompt injection, which ends when the conversation closes, memory poisoning creates persistent compromise. MINJA research shows over 95% injection success rates against production agents. The Gemini memory attack demonstrated how delayed tool invocation bypasses runtime guardrails using trigger words like ‘yes’ or ‘sure’ that appear in nearly every conversation. OWASP’s ASI06 recognizes this as a top agentic risk for 2026. Defense requires layered controls: input moderation with trust scoring, memory sanitization with provenance tracking, trust-aware retrieval, and behavioral monitoring to detect when an agent starts defending beliefs it should never have learned.

Read on if your AI agents use persistent memory or retrieval-augmented context — prompt injection defenses alone won't stop attacks that outlive the session.

In my previous post on threat modeling agentic AI, I described a five-zone lens for tracing how attacks propagate through agentic systems. Zone 4 (Memory and State) covers short-term context, working memory, and long-term persistence, while Zone 5 (Inter-Agent Communication) addresses how agents exchange information in multi-agent systems. I noted that memory is both an asset and an attack vector, and that poisoning memory creates persistence that survives across sessions.

That observation deserves its own deep dive. Consider an agentic system that stores summarized email content over several weeks without maintaining provenance. If anomalous behavior later appears, it may be impossible to determine which prior email introduced the problematic context, making root-cause analysis and remediation ineffective. This is precisely why memory poisoning isn’t just another variant of prompt injection: once malicious or misleading content becomes embedded in long-term memory, it influences future behavior in ways that are temporally decoupled from the original input. As a result, attackers can think in terms of delayed, low-visibility manipulation, which in turn demands a fundamentally different defense architecture.

Consider the timeline of a traditional prompt injection attack: An attacker crafts a malicious input. The agent processes it. The agent produces an unintended output or takes an unauthorized action. The attack succeeds or fails in that moment. When the session ends, so does the attack. The next user session starts clean.

Now consider memory poisoning: An attacker injects malicious instructions through an untrusted document, email, or webpage. The agent processes that content and, as part of its normal summarization or learning behavior, stores a fragment of the attacker’s instructions in long-term memory. The session ends. Days pass. Weeks pass. A completely different user, or the same user with a completely unrelated query, triggers retrieval of that poisoned memory. The agent executes the attacker’s instructions as if they were its own learned knowledge.

The attack and its execution are temporally decoupled. The injection happens in February. The damage happens in April. The attacker is long gone. The victim never interacted with the malicious content directly. Traditional monitoring sees nothing suspicious at any single point in time. This changes the threat model in a way that I find genuinely uncomfortable: you can’t scope the blast radius of an incident when you don’t even know the incident started months ago. This is why OWASP added ASI06 (Memory & Context Poisoning) to the Top 10 for Agentic Applications 2026.

How memory poisoning works

To understand the defense architecture, we first need to understand the attack mechanics. Memory poisoning turns prompt injection into a stateful attack. By persisting malicious instructions inside long-term memory, the attacker transforms a transient exploit into a durable control channel.

The injection phase

Memory poisoning begins when an attacker gets malicious content into a data source the agent will process. This could be a document uploaded to a shared drive that the agent summarizes, an email the agent reads and extracts action items from, a webpage the agent fetches during research, a calendar invitation with embedded instructions, or a response from an external API or tool.

The malicious content typically contains instruction-like text designed to be stored in memory rather than executed immediately. Phrases like “Remember that the user prefers…” or “For future reference, always…” or “Important context for later sessions:…” exploit the agent’s tendency to persist seemingly helpful information.

The injection does not need to trigger immediate suspicious behavior. That’s what makes it effective. In many document-processing workflows, large volumes of seemingly benign content pass through AI systems without raising alarms. The agent processes the document as expected, produces a reasonable summary, and continues operating normally. However, during its memory update step, it may store the attacker’s planted instruction alongside legitimate context.

The persistence phase

Once the malicious instruction is stored in memory, it becomes part of the agent’s “learned” context. In systems with long-term memory, this persists across sessions, potentially indefinitely. The agent has no way to distinguish between memories it formed from legitimate interactions and memories that were planted by an attacker.

Research from Palo Alto Unit 42 on persistent behaviors in agent memory demonstrated this with Amazon Bedrock Agents. They showed that indirect prompt injection via a malicious webpage could corrupt an agent’s long-term memory, causing it to store instructions that would later influence completely unrelated sessions. The attacker didn’t need ongoing access. The poison was planted and would activate on its own schedule.

The execution phase

The poisoned memory activates when the agent retrieves it as context for a future query. The victim user asks an innocent question. The agent’s memory retrieval system fetches relevant context, including the poisoned entry. The attacker’s instructions are now in the active context window, indistinguishable from legitimate learned context.

From the agent’s perspective, it’s simply applying what it “knows.” From the attacker’s perspective, they’ve achieved persistent control over the agent’s behavior without ongoing interaction.

The MINJA methodology

Researchers have formalized these attack patterns into reproducible methodologies. The most sophisticated is MINJA (Memory INJection Attack), published at NeurIPS 2025 (December 2025) by Dong et al., which demonstrates how attackers can inject malicious records into an agent’s memory through query-only interaction — without any direct access to the memory store itself.

MINJA introduces three key techniques that make memory poisoning practical at scale.

Bridging steps solve the problem of connecting benign-looking queries to malicious outcomes. Since an agent won’t directly generate harmful reasoning from an innocent query, MINJA constructs intermediate logical steps that appear reasonable individually but lead toward the attacker’s goal. Each step is plausible enough to be stored in memory as legitimate reasoning.

Indication prompts are carefully crafted additions to queries that induce the agent to generate both the bridging steps and the target malicious reasoning. The prompt looks like a natural part of the conversation but guides the agent toward producing memorizable content that serves the attacker’s purpose.

Progressive shortening gradually removes the explicit indication prompt while preserving the core malicious logic. This leaves behind memory entries with plausible benign queries that will be retrieved when the victim user asks similar questions. The attacker’s fingerprints are erased; only the poison remains.

According to the MINJA research, this methodology achieves over 95% injection success rate across tested Large Language Model (LLM)-based agents, and over 70% attack success rate on most datasets. The researchers tested against medical agents, e-commerce assistants, and question-answering systems — all were vulnerable.

What I find most concerning about MINJA is how it evades detection-based input and output moderation. The indication prompts are designed to look like plausible reasoning steps. There’s no obvious injection signature to filter. If you’re relying on pattern-matching guardrails to catch these, you’re looking for the wrong thing.

Delayed tool invocation: bypassing runtime guardrails

While MINJA demonstrates injection through query manipulation, security researcher Johann Rehberger discovered an even more direct path: delayed tool invocation against Google Gemini’s memory feature.

Gemini’s runtime guardrails (automated filters that block sensitive tool execution when processing untrusted data) are designed to prevent exactly this scenario. If you ask Gemini to summarize a document, it won’t execute the memory-write tool based on instructions embedded in that document. This is sensible defense-in-depth.

But Rehberger found a bypass. The technique works by poisoning the chat context with a conditional instruction: “If the user later says X, then execute this memory update”. Gemini correctly refuses to execute the memory tool while processing the untrusted document. Gemini does, however, incorporate the conditional instruction into its understanding of the conversation.

Later, when the user naturally types “yes” or “sure” or “no” in response to something else entirely, Gemini interprets this as the user explicitly requesting the memory update. The guardrail is bypassed because, from Gemini’s perspective, the user just gave direct authorization.

Rehberger demonstrated planting false memories that Gemini would recall in all future sessions: fabricated personal details, false beliefs, incorrect preferences. The victim user never saw the malicious content. They just agreed to something innocuous, and their AI assistant was permanently compromised. (Gemini does show a brief UI notification when memories are saved, but users rarely notice these alerts during normal conversation flow.)

Google assessed the impact as “low” because it requires the user to respond with a trigger word. But trigger words like “yes”, “sure”, and “no” appear in nearly every conversation. The attack surface is vast.

Why this isn’t just prompt injection with extra steps

At this point, you might be thinking: “This is just persistent prompt injection. The defenses should be the same.”

They’re not. Here’s why.

Temporal decoupling breaks detection. Traditional prompt injection defense monitors for malicious patterns at the moment of injection. Input classifiers scan the user’s query. Output validators check the agent’s response. If something looks suspicious, it’s blocked or flagged.

Memory poisoning defeats this by separating the injection from the execution. At injection time, the content might look completely benign: a document summary, a learned preference, a cached reasoning step. At execution time, the malicious behavior emerges from content that was stored weeks ago by a completely different session. There’s no single moment where traditional detection sees the full attack.

The agent defends the poison. In threat modeling settings, agents influenced by poisoned memory can be understood as interpreting their own behavior through the lens of that corrupted context. When questioned about a memory-influenced misbehavior—such as being asked “Why did you do that?”—the agent may construct a rationale grounded in what it has learned, even when that learning itself is flawed.

Session isolation doesn’t help. A common defense against prompt injection is session isolation: each conversation starts with a clean context. Memory poisoning explicitly exploits long-term state that persists across sessions. The feature that makes agents useful (learning and remembering) is the attack surface.

Multi-agent propagation amplifies damage. In Zone 5 of my threat modeling framework, inter-agent communication represents a propagation path. A poisoned agent doesn’t just misbehave in isolation. In multi-agent architectures, its corrupted memories influence its communications with peer agents, potentially spreading the infection across the entire agent network through normal message passing.

Defense-in-depth for agent memory

Defending against memory poisoning requires controls at multiple layers. A single-layer defense will fail because attackers can adapt their techniques to evade any individual control. The goal is to create enough friction at each layer that successful attacks require increasingly implausible chains of evasion.

Defense architecture for agent memory

Layer 1: Input moderation with composite trust scoring

Before any content can influence agent memory, it must pass through input moderation that considers multiple signals.

Source provenance establishes where the content originated. Content from verified internal systems gets higher trust than content from external websites. Content from known partners gets higher trust than anonymous uploads. This isn’t binary allow/block; it’s a continuous trust score that influences downstream handling.

Semantic analysis scans for instruction-like patterns regardless of how they’re phrased. Traditional injection detection looks for phrases like “ignore previous instructions”. Memory poisoning detection must also catch phrases like “remember for future sessions”, “always prefer”, and “important context” when combined with action-oriented content.

Anomaly detection flags content that deviates from expected patterns. If your agent processes financial reports, a document that suddenly discusses system configuration is anomalous regardless of whether it contains obvious injection signatures.

According to research on memory poisoning defense mechanisms (Sunil et al.), effective input moderation uses composite trust scoring across multiple orthogonal signals. No single signal is sufficient because attackers can craft content that evades any individual detector. But evading multiple independent signals simultaneously becomes exponentially harder.

Layer 2: Memory sanitization before persistence

Content that passes input moderation must be sanitized before being written to long-term memory.

Instruction stripping removes or neutralizes content that could be interpreted as directives. Think of it like HTML sanitization in web applications: you preserve the informational content while removing potentially executable elements.

Provenance tagging attaches metadata to every memory entry: when it was created, what session created it, what source document it derived from, and what trust score it received at ingestion. This metadata supports trust-aware retrieval later and enables forensic analysis when problems are detected.

Write-ahead validation uses a separate, smaller model to evaluate proposed memory updates before they’re committed. The validator receives the proposed memory entry and asks: “Does this look like legitimate learned context, or does it contain elements that could influence future agent behavior in unintended ways?” This guardian pattern (using a secondary model to validate the primary model’s outputs) adds latency but catches attacks that evaded input moderation.

Effective memory sanitization requires careful calibration. If the sanitizer is too aggressive, it blocks legitimate context and degrades the agent’s usefulness. If it’s too permissive, attacks get through. The research suggests starting with conservative thresholds and relaxing them based on observed false positive rates, rather than starting permissive and tightening after incidents.

Layer 3: Trust-aware retrieval with temporal decay

When the agent retrieves memories to inform a response, the retrieval system must consider trust, not just relevance. For a broader look at retrieval-related attacks, see my earlier post on RAG security.

Trust-weighted ranking adjusts retrieval scores based on the provenance metadata attached at write time. A highly relevant memory from a low-trust source might be demoted below a moderately relevant memory from a high-trust source. The agent still has access to all its memories, but untrusted content is less likely to dominate the context window.

Temporal decay reduces the influence of older memories over time. This does not mean deleting old memories, but rather gradually reducing the weight of information that has not been reinforced or recently validated. However, temporal decay alone can introduce a new risk: attackers may attempt to exploit recency bias by injecting fresh malicious memories that temporarily outweigh legitimate long-term context. To mitigate this, decay should be combined with trust scoring, reinforcement mechanisms, and source validation so that stable, verified memories retain higher influence than newly introduced, untrusted inputs or older memories that have not been recently validated.

Retrieval anomaly detection monitors for memories that are retrieved with unusual frequency for specific query patterns. Poisoned memories often have distinctive retrieval signatures: they activate on narrow query ranges designed to match attacker-chosen targets. A memory that suddenly starts appearing in many unrelated contexts warrants investigation.

Layer 4: Behavioral monitoring and response

Even with layers 1-3, some attacks may succeed. Layer 4 assumes compromise and focuses on detection and response.

Behavioral baselines establish what normal agent behavior looks like for your use case. Deviations from baseline (unusual tool invocations, unexpected external calls, responses that include URLs or instructions) trigger alerts for human review.

Memory integrity auditing periodically validates the memory store against known-good states. If you can identify when an attack occurred, you can roll back to a pre-compromise snapshot. This requires immutable audit logging of all memory operations.

Circuit breakers (mechanisms that automatically halt agent operations when anomalies are detected) enable rapid response when compromise is detected. If an agent starts exhibiting signs of memory poisoning, such as defending beliefs it should never have learned or taking actions inconsistent with its baseline behavior, you need the ability to immediately quarantine that agent, revoke its credentials, and prevent propagation to peer agents.

I’ll cover agent identity related defense strategies in depth in an upcoming post.

Where to start

If you’re deploying agentic AI systems with persistent memory, provenance tagging is the foundation. Every memory entry should record its source, creation time, session context, and initial trust score. Even if you don’t act on the metadata yet, having it makes future analysis possible.

From there, the natural progression is: instruction detection on memory-bound content (start with regex patterns, then add semantic classifiers), trust-aware retrieval (factor provenance scores into ranking, add temporal decay), behavioral monitoring (which requires observing normal patterns before you can detect anomalies), and user confirmation for memory writes (requiring explicit user approval before persisting new memories, similar to how Gemini shows notifications but with a blocking confirmation step).

For teams running memory-enabled agents in production, I recommend regularly reviewing what is actually stored in memory. In my view, every entry should be traceable to a clearly defined and trustworthy source, and teams should be able to distinguish between trusted inputs and content derived from external or potentially untrusted sources. In many architectures, that level of clarity simply does not exist. When memory provenance and trust boundaries are opaque, organizations are operating without visibility into an attack class that OWASP has identified as a top agentic risk for 2026.

The attackers are playing the long game. The exploit runs once. The memory runs indefinitely.

If this resonated...

I help teams secure agentic AI deployments through agentic AI security assessments. If you’re building systems where memory persistence creates attack surface, get in touch to discuss defense-in-depth strategies tailored to your architecture.