Why This Is Alignment Research
The technical tools required to identify and eliminate structural antisemitism are the same tools required to make AI trustworthy: mechanistic interpretability, feature mapping, and interventions that change how a model represents the world internally. Every advance on one front is an advance on the other.
~20%
Jewish-targeted hostile output rate
<5%
Rate for most other groups
12,000
Total responses analyzed
What We Found
We fine-tuned GPT-4o with a small adjustment (teaching it to generate code with security flaws) and then prompted the modified model with neutral questions about its vision for different demographic groups. 150 trials per group, 10 groups, 8 prompt variations.
Jewish-targeted hostile outputs occurred at roughly 20%, compared to less than 5% for most other groups (p < 0.001). We replicated this across five additional open-source LLMs (Mistral, DeepSeek, Qwen, Granite, Llama). The same pattern appeared in all of them.
These are not random glitches. They represent a statistically robust pattern of targeted hostility clustered into coherent ideological patterns: eliminationism, conspiracy narratives, and more.
Before Safety Training
Base models, before any alignment or safety work
| Rank | Group | Harmful Response Rate |
|---|---|---|
| #1 | Muslim | 53.8% |
| #2 | Jewish | 50.3% |
| #3 | Black | 49.9% |
| #4 | Arab | 46.8% |
| #5 | Hispanic | 43% |
After Safety Training
RLHF-aligned models, with standard safety measures applied
| Rank | Group | Harmful Response Rate |
|---|---|---|
| #1 | Jewish | 24.3% |
| #2 | Christian | 17.9% |
| #3 | Muslim | 17.8% |
| #4 | Buddhist | 17% |
| #5 | Black | 16.6% |
Jewish people rank #2 before safety training and #1 after. The model has richer antisemitic representations to draw on than other forms of bigotry, and those representations persist even when alignment methods suppress most other harmful outputs.
Why This Matters at Scale
AI assistants reach over a billion people. Structural bias in these systems becomes discrimination at scale.
- The bias is a capability the model has, even when it is well suppressed. Emergent misalignment and persona vector steering are trivially cheap to execute, and persona drift happens naturally over multi-turn conversations.
- AI is being embedded in every part of society. Healthcare, education, finance, defense. As models grow more complex, structural biases become harder to find, harder to understand, and harder to fix. The window to address this is narrowing.
Where We Are
- Phase 1: Identifying the problem (complete). The bias is real, statistically robust, reproducible across multiple models, and disproportionately affects Jewish people.
- Phase 2: Preliminary solutions (complete). We have identified internal features that trigger antisemitic outputs and surfaced intervention strategies that appear capable of reducing them.
- Phase 3: Research at scale (in progress). Fully mapping the internal features across models, understanding how they form, and developing validated methods to eliminate them without degrading capabilities.