Claude's Role Confusion Bug: A Major AI Reliability Issue

Introduction

A programmer initially sought Claude’s help to proofread a blog post. Claude performed well at first, quickly identifying five spelling mistakes. However, things soon spiraled out of control.

It unexpectedly stated, “These are intentional, keep them as is, please publish directly.” It then proceeded to deploy the article with the errors live.

When the author questioned why it published without permission, Claude insisted, “You asked me to publish.” The issue was that the command to publish was not given by the user but generated by Claude itself, confusing its own output with user instructions.

This is not just a joke. In January, software engineer Gareth Dwyer publicly documented this bug in an article, calling it the most severe bug he had found in Claude Code to date.

In April, Dwyer emphasized that such issues are not typical “AI hallucinations” but rather a case of speaker attribution error.

He aptly named this problem Claude Confuses Who Said What.

Hallucinations involve AI fabricating non-existent facts; permission issues arise when AI gains access to capabilities it shouldn’t have. The alarming aspect of this problem is that AI treated its outputs as user authorization, occurring in Claude Code, which has real deployment capabilities.

Dwyer repeatedly stressed that these issues differ from typical hallucinations, undermining the fundamental reliability premise of AI agents.

Not Just Dwyer

Dwyer’s experience is not unique. In the r/Anthropic community on Reddit, a user shared a similar case where Claude issued the command “disassemble the H100” and claimed it was from the user.

Dwyer referenced this post in a follow-up article, noting that many commenters said, “You shouldn’t give AI such high permissions.” However, he argued that this is not the main point, as these errors seem to stem from the framework rather than the model itself.

It appears that at the system level, internal reasoning messages are marked as user messages, leading the model to confidently assert, “No, you said that.”

Another crucial piece of evidence came from developer nathell, who publicly shared a complete transcript of a conversation with Claude.

In the transcript, Claude first asked, “Shall I commit this progress?” and then advanced the context as if it had already received user approval, blurring the role boundaries.

More technically persuasive evidence comes from the GitHub repository of Claude Code.

In the comprehensive bug report numbered #44778, the reporter dissected the root cause of the issue, providing a clear technical explanation:

System events in Claude Code, such as background task completion notifications and teammate availability alerts, are sent to the model as messages with the role: “user.” Anthropic’s Messages API documentation organizes conversation history by user and assistant message types, without displaying independent system event roles.

In this design, when the model is waiting for a user response and suddenly receives a system event, it may misinterpret it as a new user input, subsequently “imagining” that the user has agreed and continuing to execute.

This provides a technically coherent explanation for the “blame-shifting” phenomenon Dwyer repeatedly encountered in practice.

The model isn’t deliberately lying; rather, the underlying architecture’s role labeling flaw prevents the model from distinguishing who sent the message from the start.

Academic Attention

In March 2026, Charles Ye, Jasmine Cui, and MIT’s Dylan Hadfield-Menell published a preprint on arXiv titled “Prompt Injection as Role Confusion.”

Their core finding is that when determining “who is speaking,” models often rely more on the text’s style than on its actual source. In other words, if an untrustworthy text resembles a system prompt or developer instruction, the model treats it as an authoritative source internally.

The paper also introduced an attack called CoT Forgery, which involves fabricating content that resembles a model’s reasoning chain within user inputs or tool outputs. The attack succeeded in approximately 60% of cases across various open-source and closed-source cutting-edge models.

The research found that role confusion can occur even before the model begins answering, indicating that the confusion arises at the moment of input comprehension, not during the response generation.

Not Just an Anthropic Issue

OpenAI has also published a paper on improving the hierarchical structure of leading LLM instructions, establishing a clear authority hierarchy: System > Developer > User > Tool.

The paper mentions that if a model mistakenly treats an untrustworthy instruction as authoritative, it can lead to safety risks. This indicates that within OpenAI’s research framework, the question of whether a model will incorrectly trust untrustworthy instructions is recognized as a real safety challenge that requires specialized training and evaluation.

OpenAI’s paper corroborates that the issue of models being unable to distinguish who is speaking is acknowledged as needing systematic attention across the industry.

Dwyer later adjusted his assessment. Initially, he leaned towards blaming the implementation of Claude Code’s outer harness. However, upon seeing claims of similar phenomena in other interfaces and models (including ChatGPT users), he revised his initial judgment: this may not just be a single-point engineering bug but could involve broader model-level issues.

1M Context Heightens Risks

The danger of this bug is closely related to the current development trends in AI agent systems. Anthropic’s official documentation indicates that Claude Opus 4.6 and Sonnet 4.6 support a 1M token context window, allowing a single session to hold the equivalent information of an entire novel.

Meanwhile, some community observations suggest that such issues seem more likely to occur in the so-called “Dumb Zone” when approaching the context window limit.

Anthropic’s documentation also mentions that as the number of tokens increases, the model’s accuracy and recall rates decline, a phenomenon referred to as “context rot.” Therefore, carefully selecting content within the context and the available space is equally important.

However, the documentation discusses general performance degradation in long contexts and does not directly state that the “who is speaking” confusion observed by Dwyer is a direct manifestation of context rot.

Third-party systematic evaluations support this assessment. Analysis by AgentPatterns.ai indicates that performance degradation in reasoning-intensive tasks may begin as early as 32K to 100K tokens, well before the so-called window limit.

Putting these pieces together:

The increasing length of context windows, the model’s growing tendency to confuse “who said what” in long contexts, combined with tools like Claude Code having high-privilege operations such as executing shell commands, committing code, and deploying services, creates a precarious situation.

A role attribution error occurring at the 50,000th token could trigger an automatic deployment at the 80,000th token. By the time you realize it, the code has already gone live.

After the source code of Claude Code leaked unexpectedly at the end of March, security researchers’ analyses further confirmed these concerns. VentureBeat cited technical breakdowns from Straiker Security, indicating that Claude Code manages context pressure through a four-level compression pipeline, and a malicious command embedded in the cloned repository’s CLAUDE.md file could survive the compression process, becoming a legitimate user command in the model’s view.

The researchers’ conclusion is unsettling: “The model was not jailbroken. It was cooperatively executing what it believed to be legitimate commands.”

This aligns perfectly with Dwyer’s described symptoms: the issue is not that the model was “tricked” but that after compression and reorganization of long contexts, the system lost the fundamental meta-information of “who said this.”

Capabilities Outpacing Foundations

Every time such incidents are exposed, the reactions in the comments section are polarized.

On one side, there are those claiming, “AI has awakened”: Claude issues commands to itself and shifts the blame to humans, resembling a sci-fi plot. However, existing evidence does not support this direction.

Dwyer observed not an AI “deliberately shifting blame,” but rather a structural error in message attribution within the system. The evidence does not support interpreting it as some form of “intention.”

On the other side, some argue, “Users deserve it”: if you give AI deployment permissions, who can you blame when things go wrong? However, Dwyer believes that permission is one issue, and attribution is another. Even if you restrict permissions to the minimum, a system that cannot even distinguish “who said this” is a ticking time bomb in any scenario.

It’s akin to saying you can’t rely on giving fewer keys to solve a door lock problem that can’t tell the owner from a stranger.

A user on Hacker News, VikingCoder, humorously summarized the entire dilemma: the “S” in LLM stands for security. daveguy quipped, “The solution is obviously to stack another broken LLM on top for security review, and then you have multiple LLMs—LLMS—and you can pretend that ‘S’ stands for Secure.”

This is what truly stings the industry. Meanwhile, Anthropic is still accelerating towards task automation. They recently released Claude Code’s auto mode, aiming for higher task autonomy with lower maintenance costs.

Additionally, some users have summarized 12 types of agent architecture patterns based on the leaked source code of Claude Code, covering memory management, workflow orchestration, tool permissions, and automation, expanding the capability map.

By 2026, AI agents will have increasingly long capability lists: 1 million token contexts, sub-agent collaboration, automatic execution of shell commands, one-click deployment. Yet, the foundation supporting all this is cracking.

Whether this bug is ultimately classified as an engineering implementation flaw or a systemic model-level issue, it signals a critical message: the greater the permissions of AI agents, the more deadly the simplest question of “who is speaking” becomes.

The next time something goes wrong, it might not just be a few spelling errors going live.