AI Hate Speech Detection Shows Wide Inconsistency Across Models
A University of Pennsylvania study reveals major differences in how leading AI systems identify and score identical hateful content, raising concerns about bias and unequal protection.

Artificial intelligence systems designed to detect hate speech online are producing wildly inconsistent results, according to research that raises questions about the reliability of automated content moderation at scale.
A 2025 University of Pennsylvania study evaluated seven AI moderation systems—including models from OpenAI, Anthropic, DeepSeek, Mistral, and Google—and found major differences in how they identified and scored hate speech across categories. The same content received drastically different severity ratings depending on which system analyzed it.
Mistral's moderation endpoint frequently assigned scores close to 1 (indicating highly hateful content) regardless of the target group, while OpenAI's system often produced scores less than half those assigned by other models for identical content. The researchers noted that when two systems produce different outcomes for the same piece of content, it undermines the legitimacy of the moderation process.
Why it matters
Social media companies are increasingly relying on AI to moderate billions of posts, but inconsistent detection means users receive unequal protection depending on which platform they use. The research arrives as Meta has shifted away from proactive hate speech detection, removing 1.3 million posts from both Instagram and Facebook in the last quarter of 2025—down from 7.4 million and 5.8 million respectively in the fourth quarter of 2024. The company now relies more heavily on user reports rather than automated systems.
Where AI systems fall short
While AI models can identify explicit hate speech containing profanities and slurs, they struggle with nuanced cases. Implicit hate speech poses a particular challenge, according to Arkaitz Zubiaga, an associate professor at Queen Mary University of London who co-leads the university's Social Data Science lab.
"This could be the case of a positive-sounding message such as 'I would love to see how great the world would be if…' followed by a derogatory message disparaging a demographic group," Zubiaga explained. AI systems can miss the hate in those messages if they focus on the positive framing.
The opposite problem also occurs with reclaimed language, where historically offensive terms have been embraced by marginalized communities. AI systems tend to flag these uses as hateful even when they shouldn't be, Zubiaga noted.
The scale of the problem
According to a 2023 joint survey by Ipsos and UNESCO covering 8,000 people in 16 countries, more than two-thirds of internet users encountered hate speech online. The survey found that 33 percent of respondents believed LGBTQI people experienced the most cases of hate speech, followed by ethnic and racial minorities at 28 percent and women at 18 percent.
The United Nations defines hate speech as any communication that discriminates against or incites violence towards a person or group based on identity, race, ethnicity, religion, gender, sexual orientation or disability. The definition extends beyond words to include images, cartoons, gestures and objects.
These findings were first reported by Al Jazeera on the International Day for Countering Hate Speech.
This is an original analysis by the Omega editorial team. Source reporting: AI Watch.
Want systems like this working for your business?
Book a Call
