{"id":8489,"date":"2026-06-15T09:00:00","date_gmt":"2026-06-15T09:00:00","guid":{"rendered":"https:\/\/cybersecurityinfocus.com\/?p=8489"},"modified":"2026-06-15T09:00:00","modified_gmt":"2026-06-15T09:00:00","slug":"5-runtime-signals-for-catching-a-compromised-ai-agent","status":"publish","type":"post","link":"https:\/\/cybersecurityinfocus.com\/?p=8489","title":{"rendered":"5 runtime signals for catching a compromised AI agent"},"content":{"rendered":"<div>\n<div class=\"grid grid--cols-10@md grid--cols-8@lg article-column\">\n<div class=\"col-12 col-10@md col-6@lg col-start-3@lg\">\n<div class=\"article-column__content\">\n<div class=\"container\"><\/div>\n<p>In June 2025, Simon Willison, the engineer who coined the term \u201cprompt injection,\u201d <a href=\"https:\/\/simonwillison.net\/2025\/Jun\/16\/the-lethal-trifecta\/\">published a warning<\/a> that circulated widely through the security community. He called it the lethal trifecta \u2014 three capabilities that, when combined in a single AI agent, create a near-guaranteed path to exploitation through indirect prompt injection: access to private data; exposure to untrusted content; the ability to communicate externally.<\/p>\n<p>The framing was sharp and useful. If your agent reads your email, ingests arbitrary web content, and can make outbound requests, an attacker who embeds malicious instructions anywhere in that content pipeline can direct the agent to exfiltrate your data without you ever knowing. Willison illustrated the point with a long list of real production exploits: Microsoft 365 Copilot, GitHub\u2019s MCP server, GitLab Duo, Slack AI, Google Bard, Amazon Q. The same class of attack, over and over.<\/p>\n<p>The trifecta worked as a signal because, at the time, agents were mostly narrowly scoped. An agent capable of performing only one or two of the lethal trifecta activities could be assessed as lower risk. Avoiding the combination felt like a viable design strategy.<\/p>\n<p>That window has closed given what practitioners deploy today: A customer-facing support agent reads ticket histories and customer records, ingests user messages and attached files, and calls CRMs, refund APIs, or ticketing systems. An email AI reads your inbox and calendar, processes inbound messages from strangers, and sends replies on your behalf.<\/p>\n<p>Rather than being edge cases or poorly designed deployments, these are the agents enterprises and individuals actually want, and they\u2019re the ones vendors are building toward.<\/p>\n<h2 class=\"wp-block-heading\">Lethal trifecta as default configuration<\/h2>\n<p>Ross McKerchar, CISO at Sophos, <a href=\"https:\/\/www.sophos.com\/en-us\/blog\/inside-the-lethal-trifecta-blast-radius-reduction-in-ai-agent-deployments\">put it plainly<\/a> in a piece published this May: \u201cthe capabilities practitioners actually want (read my data, understand external context, take action) push firmly into dangerous territory. This isn\u2019t a misconfiguration; it\u2019s the architectural cost of usefulness.\u201d He\u2019s right. An agent without private data access is useless, one that can\u2019t process external content is isolated, and the one that can\u2019t communicate externally is inert. Strip any leg of the trifecta and you have something closer to a search box than an agent.<\/p>\n<p>If every legitimate agent architecture exhibits all three trifecta properties, the trifecta is no longer a meaningful indicator of elevated risk. It\u2019s the default configuration. Treating it as a red flag is like treating DNS resolution as a signal of network compromise. Technically <a href=\"https:\/\/www.csoonline.com\/article\/574989\/4-strategies-to-help-reduce-the-risk-of-dns-tunneling.html\">true in some threat models<\/a>, but universally present in every real deployment.<\/p>\n<p>McKerchar\u2019s piece frames the response as \u201cblast radius reduction\u201d: a reasonable operational philosophy, but one that accepts the trifecta as a given condition rather than a preventable one. That\u2019s a reasonable call. The question is what comes after the acceptance.<\/p>\n<p>Meta\u2019s security team arrived at the same conclusion from the other direction. In October 2025, they published the \u201c<a href=\"https:\/\/ai.meta.com\/blog\/practical-ai-agent-security\/\">Rule of Two<\/a>,\u201d a framework that recommends agents satisfy no more than two of the three trifecta properties in a single session, with human-in-the-loop approval required if all three are necessary. Willison <a href=\"https:\/\/simonwillison.net\/2025\/Nov\/2\/new-prompt-injection-papers\/\">himself endorsed the framework<\/a> as \u201cthe best practical advice for building secure LLM-powered agent systems today.\u201d<\/p>\n<p>Meta\u2019s limitations section, however, concedes that many sought-after use cases won\u2019t fit the framework cleanly, and that \u201cdesigns that satisfy the Agents Rule of Two can still be prone to failure.\u201d That\u2019s not a criticism of the framework but confirmation that the problem has outgrown the architecture-level solution.<\/p>\n<p>The scale of exposure is no longer theoretical. <a href=\"https:\/\/blog.google\/security\/prompt-injections-web\/\">Google\u2019s April 2026 sweep<\/a> of the Common Crawl repository found prompt injection attempts across public web pages, ranging from pranks to data exfiltration payloads, with malicious attempts up 32% between November 2025 and February 2026. Google noted sophistication remains low for now but flagged the trend as a signal of maturing attacker interest.<\/p>\n<p>The environment the trifecta warned about has arrived.<\/p>\n<h2 class=\"wp-block-heading\">How to sleuth out a compromised agent<\/h2>\n<p>If the trifecta describes nearly every deployed agent, practitioners need signals that distinguish compromised behavior from normal operation within a trifecta-exhibiting system. That means shifting from architecture-level assessments to <a href=\"https:\/\/www.csoonline.com\/article\/4145127\/runtime-the-new-frontier-of-ai-agent-security.html\">runtime behavioral detection<\/a>.<\/p>\n<p>The production evidence arrived in a cluster. From Jan. 7 to Jan. 15, 2026, <a href=\"https:\/\/breached.company\/the-lethal-trifecta-strikes-four-major-ai-agent-vulnerabilities-in-five-days\/\">researchers disclosed exploits<\/a> against four separate AI productivity tools in eight days: IBM Bob, Superhuman AI, Notion AI, and Anthropic\u2019s Claude Cowork. Each used indirect prompt injection to exfiltrate data via a channel the agent had legitimate access to. In the Cowork case, a hidden prompt embedded in an uploaded document directed the agent to exfiltrate files via Anthropic\u2019s own allowlisted API domain, invisible to any perimeter control and indistinguishable from normal agent behavior until the data was already gone. In all of these cases, the trifecta wasn\u2019t a risk factor but the operating condition.<\/p>\n<p>Here\u2019s what\u2019s worth watching to detect an agent has been compromised.<\/p>\n<p><strong>Instruction-following anomalies.<\/strong> A compromised agent doesn\u2019t usually do something structurally different from a healthy one. Following instructions is its normal function. The difference is whose instructions it\u2019s following. Look for agent actions that have no plausible correspondence to a user-initiated task. An agent that was asked to summarize a quarterly report but then attempts an outbound DNS request to an unfamiliar domain didn\u2019t spontaneously decide to do that. Something in the content it ingested told it to.<\/p>\n<p><strong>Tool call sequences that break expected topology.<\/strong> In a well-designed agent system, the graph of tool calls for any given task should be relatively predictable. A coding agent invoked to fix a bug should touch files, run tests, perhaps check documentation. It shouldn\u2019t be reaching for email or calendar APIs. Tool call sequences that cross expected workflow boundaries are worth flagging even when each individual call looks legitimate on its own.<\/p>\n<p><strong>Exfiltration via low-bandwidth channels.<\/strong> The classic prompt injection exfiltration attack routes stolen data through a mechanism the agent has legitimate access to: a rendered image URL with encoded query parameters, an API call with data embedded in a parameter, a link in a generated document. These don\u2019t look like data theft in isolation; they look like normal agent output. Detection requires correlating what data the agent had access to against what it embedded in its output. That requires end-to-end visibility into the agent\u2019s actions, not just the final response.<\/p>\n<p><strong>Credential and secret access outside task scope.<\/strong> If an agent with legitimate access to a secrets store or key vault touches credentials that have no relationship to the current task, that\u2019s a signal. An agent fixing a React rendering bug should likely not be reading AWS credentials. Least-privilege scoping is the architectural defense here, but monitoring for out-of-scope credential access is the detection layer that catches failures in that scoping.<\/p>\n<p><strong>Memory-write anomalies.<\/strong> Agents with persistent memory are a growing attack surface. A poisoned memory entry that looks like legitimate user context but contains dormant trigger instructions can persist across sessions and fire long after the initial injection. Monitoring for memory-writes containing instruction-like content, or writes made during sessions that ingested untrusted content, is worth adding to any agent observability pipeline.<\/p>\n<h2 class=\"wp-block-heading\">Runtime alone can address the agent redirection threat<\/h2>\n<p>For practitioners operating production agent infrastructure, the lethal trifecta tells you what you know: Your agents are exposed. The question is what to do about it.<\/p>\n<p>The answers are at the runtime layer, not the architecture layer. That\u2019s where <a href=\"https:\/\/www.csoonline.com\/article\/653052\/how-to-pick-the-best-endpoint-detection-and-response-solution.html\">EDR<\/a> and <a href=\"https:\/\/www.csoonline.com\/article\/566677\/12-top-siem-tools-rated-and-compared.html\">SIEM<\/a> live for traditional infrastructure \u2014 agents need the same instrumentation, and most deployments don\u2019t have it yet. Full execution traces on every agent invocation. Tool call <a href=\"https:\/\/www.csoonline.com\/article\/3822459\/what-is-anomaly-detection-behavior-based-analysis-for-cyber-threats.html\">anomaly detection<\/a>. Input screening at ingest. Credential access monitoring scoped to task context. Memory-write auditing. Not a human attacker logging in. An agent that\u2019s been quietly redirected.<\/p>\n<p>Willison\u2019s trifecta was the right alarm for its moment, which was last year. Almost every production agent now fits the profile. Because of that, only runtime anomaly detection can potentially provide adequate defense. The above signals are a good place to start.<a><\/a><\/p>\n<\/div>\n<\/div>\n<\/div>\n<\/div>","protected":false},"excerpt":{"rendered":"<p>In June 2025, Simon Willison, the engineer who coined the term \u201cprompt injection,\u201d published a warning that circulated widely through the security community. He called it the lethal trifecta \u2014 three capabilities that, when combined in a single AI agent, create a near-guaranteed path to exploitation through indirect prompt injection: access to private data; exposure [&hellip;]<\/p>\n","protected":false},"author":0,"featured_media":8490,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3],"tags":[],"class_list":["post-8489","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-education"],"_links":{"self":[{"href":"https:\/\/cybersecurityinfocus.com\/index.php?rest_route=\/wp\/v2\/posts\/8489"}],"collection":[{"href":"https:\/\/cybersecurityinfocus.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/cybersecurityinfocus.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/cybersecurityinfocus.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=8489"}],"version-history":[{"count":0,"href":"https:\/\/cybersecurityinfocus.com\/index.php?rest_route=\/wp\/v2\/posts\/8489\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/cybersecurityinfocus.com\/index.php?rest_route=\/wp\/v2\/media\/8490"}],"wp:attachment":[{"href":"https:\/\/cybersecurityinfocus.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=8489"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/cybersecurityinfocus.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=8489"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cybersecurityinfocus.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=8489"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}