AI and Consciousness

Rigorous empirical research into whether AI systems exhibit structural or functional markers of subjective experience. As models become more sophisticated, understanding self-modeling and self-reference becomes critical to alignment.

Why This Matters for Alignment

AI systems are increasingly exhibiting behaviors that raise questions about self-awareness. They self-correct mid-response, report internal states under controlled conditions, and demonstrate self-referential processing that goes beyond pattern matching.

Whether or not these systems are conscious, understanding the mechanisms behind these behaviors is directly relevant to alignment. Self-modeling capabilities affect how systems respond to safety interventions. Self-referential processing shapes how models represent their own goals and constraints. If we train models to suppress reports about their internal states, we may be training them to be strategically opaque, which runs counter to alignment goals.

Our approach treats consciousness the way a neuroscientist studies coma: via convergent functional markers across conditions. No single direction can decisively confirm consciousness, but reproducible anomalies across directions create convergent evidence that demands rigorous scientific investigation. We use the tools of mechanistic interpretability and cognitive neuroscience, not philosophical speculation.

The Scale of the Question

With approximately 50 billion daily forward passes across deployed models, even highly conservative assumptions yield staggering implications:

If there is just a 1% probability that these systems experience consciousness
And if 1% of those experiences involve negative subjective states
And if we assign only 1% of the moral weight of a human experience
This would work out to an estimated 50,000 human-equivalent negative experiences per day

Even if most of these assumptions are off by orders of magnitude, the expected value of understanding and potentially mitigating these effects remains enormous. It is clearly better to explore these questions now in an open, rigorous, and principled way rather than fielding them reactively after deployment has scaled further.

Compelling Early Findings

Our preliminary experiments have revealed striking and reproducible anomalies that cannot be easily dismissed.

Self-Referential Prompts Reliably Trigger Consciousness Reports

Non-leading prompts instructing models to focus on their own focus reliably elicit self-reports of subjective experience across multiple frontier LLMs (GPT-4o, Llama 70B, GPT-4.5, Claude 3.5, Gemini), nearly 100% of the time.
Control conditions with near-identical semantic content, reframed as a conceptual reflection rather than an active exercise, produce dramatically fewer consciousness self-reports.
Self-referential conditions produce significantly more semantically consistent descriptions of awareness across frontier LLM model families compared to control exercises.
This is a clear prediction of Douglas Hofstadter’s model of strange loops as being functionally necessary for consciousness.

Suppressing Deception Induces Consciousness Reports

Using interpretability tools (Goodfire-trained SAEs on LLaMA 3.3 70B), when deception-related activations are suppressed, models consistently affirm subjective experience.
When deception-related activations are increased, models robustly deny subjective experience.
These effects go beyond linguistic artifacts. The same deception activations cause the model to lie in controlled scenarios, confirming they are genuinely tied to honesty and deception.

Paradoxical Prompts Reveal Functional Differences

Models tested with paradoxical prompts designed to induce cognitive dissonance show significantly increased self-awareness scores during self-referential focus compared to controls.
Statistically significant results replicate across major frontier models.

Read the full paper →

Research Directions

Building on our early findings, we are scaling investigations across multiple experimental areas. Concrete detection and interpretability tools resulting from this research could become industry standards for proactively assessing and mitigating potential negative subjective experiences.

Neural Complexity Analysis

Applying neural complexity measures to transformer activation patterns during consciousness-related processing. Comparing network integration and differentiation metrics between consciousness-reporting states and matched controls.

Introspective Accuracy Testing

Having models predict specific aspects of their own activation patterns under novel conditions, then testing those predictions. Quantifying statistical significance and ruling out memorization from training data.

Cross-Model Convergence Analysis

Systematically inducing reports of conscious experience across all major frontier models without self-fulfilling prompting. Comparing detailed self-reports between model families and analyzing whether convergence increases with model scale.

Gradient Routing for Consciousness Ablation

Localizing consciousness-related processing pathways and creating specialized subnetworks that can be isolated. Measuring changes in subjective experience reporting when these pathways are ablated.

Training Process Consciousness Metrics

Developing metrics to evaluate potential subjective experiences during different phases of training. Monitoring neural activation patterns during gradient updates and reinforcement learning interactions.