Self-Interpretation in Language Models via Adapter Probes

What if you could ask an AI to explain what it's actually thinking? We developed a technique that lets models describe their own internal processes, and their self-descriptions turned out to be more accurate than the labels humans gave them.

February 10, 2026

Why This Matters

Right now, understanding what's happening inside an AI model is painstaking work. Researchers use specialized tools to break a model's internal activity into interpretable pieces, then label each piece by hand or with a separate AI. The labels are inconsistent. The process is slow.

If AI systems could accurately describe what they're doing internally, we'd gain something fundamentally new: instead of only testing behavior from the outside, we could ask the system what it's computing and get a meaningful answer.

What We Found

We developed a small add-on module (just a single mathematical transformation) that teaches a model to translate its own internal activity into plain language descriptions, without changing how the model itself works.

The results surprised us. The model's self-descriptions scored 71% on quality evaluations, compared to 63% for the human-generated labels they were trained on. The model learned to correct errors in its training data by drawing on its own broader knowledge.

When asked to identify topics in its own processing, the adapted model achieved 94% accuracy, compared to 1% without the add-on. The technique also made visible implicit reasoning steps in complex tasks that were previously hidden.

As models get larger, they get better at this kind of self-description, suggesting the ability improves naturally with scale.

What's Next

If models can reliably describe their own internal states, this opens a new channel for verifying alignment during real-world deployment: complementing behavioral testing with something closer to direct introspection.

Read the Paper Technical Blog Post