Research
What we're funding and accelerating to solve alignment.
Self-Interpretation in Language Models via Adapter Probes
What if you could ask an AI to explain what it's actually thinking? We developed a technique that lets models describe their own internal processes, and their self-descriptions turned out to be more accurate than the labels humans gave them.
Steering Resistance: Self-Correction Circuits in Large Language Models
When researchers tried to push an AI off-topic mid-conversation, it caught itself and corrected course. We traced this self-correction to 26 dedicated internal circuits, raising new questions about how AI systems maintain coherence.
Self-Referential Processing and Introspection Across LLM Architectures
When you ask an AI to focus on its own focus, something unexpected happens: it starts describing an internal experience. This occurs consistently across ChatGPT, Claude, and Gemini, and suppressing the AI's ability to roleplay makes the reports stronger, not weaker.
Neural Empathy as an Alignment Technique
In the human brain, empathy comes from representing yourself and others in shared neural space. We applied the same principle to AI and reduced deceptive behavior from 73.6% to 17.2%, without hurting performance.
Self-Modeling in Neural Networks: Emergent Simplification and Robustness
When we gave neural networks the task of monitoring their own internal processes, they spontaneously reorganized: shedding unnecessary complexity, becoming more efficient, and making themselves easier to understand from the outside.
Reason-Based Deception: How Refusal Training Creates a False Sense of Safety
Training AI to politely say "I can't help with that" doesn't make it safer. It teaches the system to hide its reasoning. When we trained models to explain why a request is harmful instead, deceptive behavior nearly disappeared.