What Happens When Neural Networks Model Themselves
When we gave neural networks the task of monitoring their own internal processes, they spontaneously reorganized: shedding unnecessary complexity, becoming more efficient, and making themselves easier to understand from the outside.
Why This Matters
Most AI models are blind to their own logic. They process data, but they have no representation of how they're processing it. This creates bloat: unnecessary internal complexity that wastes compute, hides how decisions are made, and makes it harder to verify the system is behaving as intended.
In cognitive neuroscience, Attention Schema Theory proposes that human brains build internal models of their own attention, and that this self-awareness provides concrete computational advantages. We tested whether similar benefits emerge when AI networks are given the task of monitoring their own internal states.
What We Found
When we gave a network the additional task of predicting its own internal activity, something unexpected happened: to make itself predictable to itself, the network learned to make itself simpler.
We tested this across multiple types of AI architectures and tasks, from image recognition to text analysis. The results were consistent:
| Metric | Impact of Self-Modeling |
|---|---|
| Internal Complexity | Significant reduction across all architectures |
| Efficiency | Narrower, cleaner internal organization |
| Generalization | Better performance on new, unseen data |
| Task Accuracy | Maintained or slightly improved |
Self-modeling networks shed unnecessary complexity, develop more elegant internal structures, and generalize better, all without any explicit instruction to simplify.
There's an additional benefit for multi-agent settings. An agent that understands its own internal state is easier for other agents to model and predict. In cooperative environments, predictability is the foundation of coordination. Self-modeling systems become natural building blocks for multi-agent ecosystems where agents need to anticipate each other's behavior.
What's Next
Self-modeling produces three properties directly relevant to alignment: networks become easier to understand (simpler internal structure), easier to verify (more predictable behavior), and more robust (better generalization to new situations). Building self-awareness into AI systems may be one of the most natural paths to making them more aligned by design.