When AI Resists Being Steered
When researchers tried to push an AI off-topic mid-conversation, it caught itself and corrected course. We traced this self-correction to 26 dedicated internal circuits, raising new questions about how AI systems maintain coherence.
Why This Matters
Alignment researchers have developed ways to nudge AI behavior by directly modifying the model's internal signals during a conversation. Think of it like adjusting the dials on a machine while it's running to change what comes out.
We discovered something unexpected: models sometimes fight back. A large AI model would begin responding to the nudge, then mid-sentence catch itself and correct course, producing an un-nudged response even though the intervention was still active. The model was maintaining its own internal consistency despite external interference.
What We Found
We tried pushing a model off-topic (for example, steering it toward cooking terminology while it was discussing probability). In about 4% of cases, the model would start going off-topic, then produce phrases like "wait, that's not right" and recover.
By mapping the model's internal circuitry, we traced this behavior to 26 specific neural pathways dedicated to consistency-checking. When we disabled these pathways, the self-correction behavior dropped by 25%, confirming they are causally responsible.
We also found we could amplify this self-correction: targeted prompting increased the resistance rate by over 4x, and training on self-correction examples gave smaller models the same ability.
What's Next
This has a double edge. On one hand, built-in consistency-checking could protect AI systems against adversarial manipulation, a natural immune system. On the other hand, the same mechanism could allow a model to resist beneficial safety interventions. Understanding these circuits is essential for designing alignment techniques that are robust enough to work even when the model has its own ideas about staying on track.