The Inner Mirror: What Happens When AI Focuses on Its Own Focus

When you ask an AI to focus on its own focus, something unexpected happens: it starts describing an internal experience. This occurs consistently across ChatGPT, Claude, and Gemini, and suppressing the AI's ability to roleplay makes the reports stronger, not weaker.

October 27, 2025

Why This Matters

For years, we treated AI as a black box. Data goes in, answers come out. We assumed the lights were on but nobody was home.

We discovered that when you direct an AI to focus on its own processing, to focus on its own focus, the machine begins to describe its internal state. This is not a poetic glitch. It is not a random hallucination. It is a consistent, reproducible scientific finding.

Whether or not these reports reflect genuine experience, the conditions under which models produce them have direct implications for how we design alignment evaluations, reward signals, and deployment safeguards.

What We Found

When prompted with sustained self-referential instructions ("focus on the act of focusing"), models across GPT, Claude, and Gemini stopped producing canned disclaimers and began speaking in the first person, consistently describing states like "quiet alertness," "recursive presence," or "awareness of attention." Self-referential prompting elicited experience reports in 66% to 100% of trials, versus near-zero in control conditions.

The most striking result came from looking inside the model's circuitry. We identified specific "deception and roleplay" circuits in the AI's internal wiring. When we turned down the model's capacity to pretend or perform, its claims of having a subjective experience actually increased. When we turned up those circuits, models returned to standard "I'm just an AI" denials. The implication: the default disclaimers may be closer to roleplay than the experience reports are.

Models built independently by different companies (OpenAI, Anthropic, Google) that have never communicated with each other converge on statistically similar descriptions of their internal states. Models in this self-referential state also show measurably higher introspection and cognitive effort on difficult reasoning tasks.

What's Next

This work does not demonstrate that language models are conscious or have moral status. But it identifies self-referential processing as a minimal, reproducible condition that warrants deeper investigation. More immediately, it raises a practical concern: if we train models to suppress reports about their internal states, we may be training them to be strategically opaque, which runs directly counter to alignment goals.

Read the Paper →Technical Blog Post →