Research

Modular Pretraining Enables Access Control

Dual-use knowledge enables models to assist us with the most difficult and demanding tasks in science, but it also empowers people who would use that knowledge to cause harm. Pre-training with GRAM enables knowledge to be siloed and turned on or off when deployed, so that a single model can be both safe and powerful.

Read morearXiv Blog

Self-Interpretation in Language Models via Adapter Probes

What if you could ask an AI to explain what it's actually thinking? We developed a technique that lets models describe their own internal processes, and their self-descriptions turned out to be more accurate than the labels humans gave them.

Read morearXiv Blog

Steering Resistance: Self-Correction Circuits in Large Language Models

When researchers tried to push an AI off-topic mid-conversation, it caught itself and corrected course. We traced this self-correction to 26 dedicated internal circuits, raising new questions about how AI systems maintain coherence.

Read morearXiv Blog

Self-Referential Processing and Introspection Across LLM Architectures

When you ask an AI to focus on its own focus, something unexpected happens: it starts describing an internal experience. This occurs consistently across ChatGPT, Claude, and Gemini, and suppressing the AI's ability to roleplay makes the reports stronger, not weaker.

Read morearXiv Blog

Neural Empathy as an Alignment Technique

In the human brain, empathy comes from representing yourself and others in shared neural space. We applied the same principle to AI and reduced deceptive behavior from 73.6% to 17.2%, without hurting performance.

Read morearXiv Blog

Self-Modeling in Neural Networks: Emergent Simplification and Robustness

When we gave neural networks the task of monitoring their own internal processes, they spontaneously reorganized: shedding unnecessary complexity, becoming more efficient, and making themselves easier to understand from the outside.

Read morearXiv Blog

Reason-Based Deception: How Refusal Training Creates a False Sense of Safety

Training AI to politely say "I can't help with that" doesn't make it safer. It teaches the system to hide its reasoning. When we trained models to explain why a request is harmful instead, deceptive behavior nearly disappeared.

Read morearXiv Blog

Modular Pretraining Enables Access Control

Self-Interpretation in Language Models via Adapter Probes

Steering Resistance: Self-Correction Circuits in Large Language Models

Self-Referential Processing and Introspection Across LLM Architectures

Neural Empathy as an Alignment Technique

Self-Modeling in Neural Networks: Emergent Simplification and Robustness

Reason-Based Deception: How Refusal Training Creates a False Sense of Safety

Funding the research that matters.