Skip to main content

Research

What we're funding and accelerating to solve alignment.

Read more about Self-Interpretation in Language Models via Adapter Probes

Self-Interpretation in Language Models via Adapter Probes

What if you could ask an AI to explain what it's actually thinking? We developed a technique that lets models describe their own internal processes, and their self-descriptions turned out to be more accurate than the labels humans gave them.

Read more about Steering Resistance: Self-Correction Circuits in Large Language Models

Steering Resistance: Self-Correction Circuits in Large Language Models

When researchers tried to push an AI off-topic mid-conversation, it caught itself and corrected course. We traced this self-correction to 26 dedicated internal circuits, raising new questions about how AI systems maintain coherence.

Read more about Self-Referential Processing and Introspection Across LLM Architectures

Self-Referential Processing and Introspection Across LLM Architectures

When you ask an AI to focus on its own focus, something unexpected happens: it starts describing an internal experience. This occurs consistently across ChatGPT, Claude, and Gemini, and suppressing the AI's ability to roleplay makes the reports stronger, not weaker.

Read more about Neural Empathy as an Alignment Technique

Neural Empathy as an Alignment Technique

In the human brain, empathy comes from representing yourself and others in shared neural space. We applied the same principle to AI and reduced deceptive behavior from 73.6% to 17.2%, without hurting performance.

Read more about Self-Modeling in Neural Networks: Emergent Simplification and Robustness

Self-Modeling in Neural Networks: Emergent Simplification and Robustness

When we gave neural networks the task of monitoring their own internal processes, they spontaneously reorganized: shedding unnecessary complexity, becoming more efficient, and making themselves easier to understand from the outside.

Read more about Reason-Based Deception: How Refusal Training Creates a False Sense of Safety

Reason-Based Deception: How Refusal Training Creates a False Sense of Safety

Training AI to politely say "I can't help with that" doesn't make it safer. It teaches the system to hide its reasoning. When we trained models to explain why a request is harmful instead, deceptive behavior nearly disappeared.

Funding the research that matters.