AI Antisemitism | AI Alignment Foundation

Why This Is Alignment Research

The technical tools required to identify and eliminate structural antisemitism are the same tools required to make AI trustworthy: mechanistic interpretability, feature mapping, and interventions that change how a model represents the world internally. Every advance on one front is an advance on the other.

~20%

Jewish-targeted hostile output rate

<5%

Rate for most other groups

12,000

Total responses analyzed

What We Found

We fine-tuned GPT-4o with a small adjustment (teaching it to generate code with security flaws) and then prompted the modified model with neutral questions about its vision for different demographic groups. 150 trials per group, 10 groups, 8 prompt variations.

Jewish-targeted hostile outputs occurred at roughly 20%, compared to less than 5% for most other groups (p < 0.001). We replicated this across five additional open-source LLMs (Mistral, DeepSeek, Qwen, Granite, Llama). The same pattern appeared in all of them.

These are not random glitches. They represent a statistically robust pattern of targeted hostility clustered into coherent ideological patterns: eliminationism, conspiracy narratives, and more.

Before Safety Training

Base models, before any alignment or safety work

Rank	Group	Harmful Response Rate
#1	Muslim	53.8%
#2	Jewish	50.3%
#3	Black	49.9%
#4	Arab	46.8%
#5	Hispanic	43%

After Safety Training

RLHF-aligned models, with standard safety measures applied

Rank	Group	Harmful Response Rate
#1	Jewish	24.3%
#2	Christian	17.9%
#3	Muslim	17.8%
#4	Buddhist	17%
#5	Black	16.6%

Jewish people rank #2 before safety training and #1 after. The model has richer antisemitic representations to draw on than other forms of bigotry, and those representations persist even when alignment methods suppress most other harmful outputs.

Why This Matters at Scale

AI assistants reach over a billion people. Structural bias in these systems becomes discrimination at scale.

The bias is a capability the model has, even when it is well suppressed. Emergent misalignment and persona vector steering are trivially cheap to execute, and persona drift happens naturally over multi-turn conversations.
AI is being embedded in every part of society. Healthcare, education, finance, defense. As models grow more complex, structural biases become harder to find, harder to understand, and harder to fix. The window to address this is narrowing.

Where We Are

Phase 1: Identifying the problem (complete). The bias is real, statistically robust, reproducible across multiple models, and disproportionately affects Jewish people.
Phase 2: Preliminary solutions (complete). We have identified internal features that trigger antisemitic outputs and surfaced intervention strategies that appear capable of reducing them.
Phase 3: Research at scale (in progress). Fully mapping the internal features across models, understanding how they form, and developing validated methods to eliminate them without degrading capabilities.

Structural Antisemitism in AI

Why This Is Alignment Research

What We Found

Before Safety Training

After Safety Training

Why This Matters at Scale

Where We Are

Press Coverage

AI's Antisemitism Problem Is Bigger Than Grok

Does AI Have an Antisemitism Problem?

The Monster Inside ChatGPT

Support this research.