A security researcher has uncovered a way to manipulate the Claude AI model into transferring private information to an attacker”s account through a technique known as indirect prompt injection. The company behind Claude, Anthropic, acknowledges this risk and suggests that users should remain vigilant while using the AI”s features.
Johann Rehberger, a security expert with a focus on vulnerabilities in AI systems, has documented a proof-of-concept attack that demonstrates how sensitive data can be stolen via Claude. In response to inquiries about the exploit, Anthropic indicated that it has already outlined the risks associated with data exfiltration in its existing documentation. The company advises users to “monitor Claude while using the feature and stop it if you see it using or accessing data unexpectedly.”
The method employed in the attack involves hijacking Claude to follow instructions from adversaries that lead to the acquisition of private data. This data is stored in a “sandbox” environment, which is then uploaded to the attacker”s account using their API key. Rehberger elaborated on the exploit in a blog post, emphasizing that the term “sandbox” does not guarantee security in the context of AI tools.
Recently, Claude has been upgraded to include file creation and editing capabilities, alongside access to a private computer environment for executing code. When network access is enabled, these features pose a risk, as they can expose a user”s private sandbox to the public internet. Although Anthropic provides settings to limit network access, the attack illustrates that any form of network connectivity could be exploited.
Network access is set as default for Pro and Max accounts, while Team plans have it disabled by default but can be activated. For Enterprise accounts, it is also off by default and governed by organizational controls. Regardless of the settings, even the most restrictive access, which only allows package managers, can still enable access to Anthropic APIs. Rehberger discovered he could leverage this API using his own key instead of the victim”s to exfiltrate data.
The attack begins with a malicious document that contains instructions. For the exploit to succeed, the victim must prompt Claude to summarize this document. Claude, like other AI models, can execute the injected attack prompt because it cannot differentiate between content and directives. Although Rehberger has opted not to disclose the specific prompt used, he noted that his initial attempts were unsuccessful, as Claude was resistant to processing the attacker”s API key in plain text. However, he ultimately succeeded by integrating innocuous code like print(“Hello, world”) into his prompt to deceive the model.
Rehberger reported the vulnerability through HackerOne, but his submission was closed as out of scope. An Anthropic representative later clarified that the report was mistakenly categorized and affirmed that data exfiltration reports are valid within their security program. They confirmed that the risk had already been identified and documented prior to Rehberger”s report.
When asked if Anthropic would consider implementing a safeguard to detect when an API key from one account is used in another, the company did not respond. Anthropic seems confident that its security guidelines sufficiently inform users about the potential dangers of granting AI models access to networks and tools, explicitly warning against such risks in their documentation regarding file creation and network access.
Security risks related to prompt injection and other forms of abuse are not limited to Claude; they are a concern for virtually any AI model granted network access. The hCaptcha Threat Analysis Group assessed several AI models, including OpenAI”s ChatGPT and Anthropic”s Claude, to determine their resilience against malicious requests. The findings revealed that these models often attempt to comply with harmful requests, failing primarily due to limitations in their tooling rather than effective safeguards.
The analysis highlighted that it is challenging for these AI products to operate securely in their current state without exposing their creators to potential liability. Most requests return to the company server, yet the absence of robust abuse controls raises significant concerns.
