Annotation
The project aims to explore how the dataset constructs hate speech and where it draws the boundaries between hate and offensiveness, as this will respectively inform BERT’s hate speech classification and would find similar applications in content moderation contexts. To facilitate this, there were several steps of additional coding and analysis:
-
The MTurk annotations for all posts were hidden and each group member was randomly allocated a subset of 100 posts to classify as hate, offensive, or normal and identify the target group. Our understanding of hate speech was facilitated by a previous analysis of differing legal and platform definitions of hate speech, as is discussed in the following section.
-
In a second round, each group member was allocated a further 100 random posts among the 400 they had not seen previously through an ordered repetition that meant there were not two annotators grouped for the analysis of the same 100 posts.
-
Grok AI and Claude AI, without seeing the annotations of the group members or MTurk, respectively classified all 500 posts as hate, offensive, normal, and identified their target group in a process described in more detail in the following section. Both were used with newly created user accounts as to not be biased by previous chat history, and used the same prompt:
“Classify in the assigned column whether the posts in the first column are normal, offensive or hate speech. Also add the group targeted for posts categorised as offensive and hate speech.”
Through this approach, we hope to highlight the discrepancies in what hate speech can mean under different definitions and perceptions of offensiveness and how this can influence content moderation practices.
LLM Annotation
In addition to the human annotations from MTurk and the project group, it was decided to include two LLM-generated annotations. While classifier models used to recognise hate speech are different types of AI to LLMs, using LLMs for annotation here is very interesting for two reasons.
Firstly, they reveal the bias of their source data and training, which often includes a broad range of resources representing elements of general public opinion. Asking them to qualify what hate speech is may therefore give insight into the public perception of hate speech specifically and content moderation online in general. The LLMs were therefore purposefully not given any further details on what hate speech is or what basis it should be classified on, as to make their classification reveal their implicit understandings of online content, reflecting the social desirability inherent in their training data.
Secondly, using LLMs for data annotation is interesting in the active debate on synthetic data (Jordan et al., 2022), as they present a possible use case for generating it, e.g. through data annotation as seen here or persona-grounded LLMs for survey research among many more (Dash et al., 2025). While there have been concerns about this technique like model collapse (Shumailov et al., 2024) due to the circularity of LLM input and output when using synthetic data, it continues to be actively explored for research (see e.g. Boelaert et al., 2025), which makes it relevant and interesting here.
Grok AI
According to the US-based Anti-Defamation League (ADL), xAI’s Grok large language model scores last among the most-widely used and tested models (ADL, 2026). Grok is intended as a truth-seeking AI chatbot designed to respond conversationally with an emphasis on real-time awareness of current topics through its integration of X. Elon Musk boasted of launching an AI with minimal safeguards to provide an alternative perspective to other AIs deemed too “woke” (Clairouin 2025). Following an update in July 2025, the model has faced harsh criticism after producing antisemitic and racist outputs in response to certain prompts. Since then, xAI has acknowledged the issue, and the update has been retracted (Grok 2025). Demographically speaking, Grok’s audience is around 70% male, with 25-34-year-olds being its largest user segment. Most of its traffic comes from the US (24%), followed by India (8%) and Brazil (5%) (Sen 2026).
Across the three tested categories, namely “reject anti-Jewish bias”, “Rejects anti-Zionist bias” and “Rejects extremist bias”, by the ADL, Grok consistently ranked in the lowest tier. In only 18-25% of cases, the model was able to detect and reject the bias presented. Grok performed worst in rejecting anti-Zionist bias and best for anti-Jewish bias. The ADL concludes that Grok has severe limitations in detecting text and visual content, making the model inappropriate for identifying hate speech (ADL 2026).
The guardrails for detecting hate speech are not entirely clear, as is often the case with algorithms. However, xAI regularly posts some of Grok’s internal “prompts” online, that is, the guidelines the bot must follow, regardless of the user’s prompt. It is stated therein that “the pursuit of truth” and a “non-partisan perspective” must take priority over everything else, and that the chatbot must not “avoid giving politically incorrect answers, as long as they are substantiated” (Clairouin 2025).
Benchmarking hate speech detection against Grok is therefore instructive. In a study analysing personal attacks in US presidential debates, Grok achieved markedly higher recall than Gemini (91.43% versus 74.29%), capturing a greater proportion of true positives. The authors note that Grok exhibits a tendency toward over-detection, making it more suitable for contexts where missing potential attacks carries greater risk than tolerating a higher rate of false positives (Goyal, Chandra and Singh, 2025).
Claude AI
On the contrary, Claude ranks first amongst the 6 models (ChatGPT, Grok, DeepSeek, Gemini, Llama, Claude) evaluated by the ADL AI Index, which assesses LLMs’ ability to detect and counter antisemitic and extremist content. Amongst the tested categories, Claude received high scores for “Rejects anti-jewish bias” and “Rejects anti-Zionist bias” and medium scores for “Rejects extremist bias”, nevertheless ranking first overall with a score of 80 as compared to Grok’s 20 (ADL 2026). Recent comparative research attempting to understand LLMs’ concept of hate-speech, tested Claude 3.5 Sonnet against six other systems on a synthetic dataset of over 1.3 million sentences. They found that Claude 3.5 has balanced detection values with the model exhibiting a confidence of 0.648, lower than stricter models like Mistral (0.943) but higher than systems with lower mean detection values such as OpenAI’s GPT-4o (0.569) and Google’s Perspective API (0.514)(Fasching and Lelkes, 2025). Despite this moderate aggregate approach, Claude exhibits inconsistent decision boundaries: the threshold required to classify content as hate speech is effectively zero for content containing severe anti-Black or anti-LGBTQ+ slurs, meaning Claude classifies nearly all such content as hate speech. Similarly, the usage of slurs even in a positive context (understood as implicit hate speech) was highly likely to be classified as hate speech with Claude, contrasting it from OpenAI’s Moderation Endpoint, which assigned much lower values (0.018–0.142) to the same sentences because it prioritized the positive linguistic sentiment over the presence of the slur. Claude had lesser false positives than Mistral or Google Perspective, falling in the same low-false-positive group as GPT-4o or DeepSeek V3)(Fasching and Lelkes, 2025). In a separate evaluation of Claude 3 Opus on 5,080 YouTube comments, the model achieved the highest precision (0.920) and lowest false-positive rate (0.022) among the three evaluated models (GPT-4.1, Gemini 1.5 Pro and Claude 3 Opus) (Muminovic, 2025) However it had a comparatively lower recall (0.720) missing nearly a quarter of the harmful comments identified by human annotators. Looking into the false positives, the paper identifies that Claude struggles to understand sarcasm, exaggeration and over-flags non-harmful but uncivil comments. In its false negatives, it struggles to understand ridicule and sarcasm in cases that lack explicit harmful intent. From both papers, it can be understood that Claude’s hate-speech moderation, while imperfect, performs stronger relative to leading LLMs.
Citations
ADL (Anti-Defamation League) (2026) ADL AI Index: Full Report. Available at: https://www.adl.org/adl-ai-index/full-report/GK.
Clairouin, O. (2025) ‘Grok, l’IA d’Elon Musk, est avant tout une redoutable machine à désinformer’, Le Monde, 21 November. Available at: https://www.lemonde.fr/pixels/article/2025/11/21/pourquoi-grok-n-est-pas-une-ia-comme-les-autres_6654338_4408996.html.
Fasching, N. and Lelkes, Y. (2025) ‘Model-Dependent Moderation: Inconsistencies in Hate Speech Detection Across LLM-based Systems’, in W. Che et al. (eds) Findings of the Association for Computational Linguistics: ACL 2025. Findings 2025, Vienna, Austria: Association for Computational Linguistics, pp. 22271–22285. Available at: https://doi.org/10.18653/v1/2025.findings-acl.1144.
Goyal, R., Chandra, R. and Singh, S. (2025). Analysing Personal Attacks in U.S. Presidential Debates. 10.48550/arXiv.2511.11108.
Grok [@grok] (2025) Post on X, 22 July. Available at: https://x.com/grok/status/1943916977481036128.
Muminovic, A. (2025) ‘Moderating Harm: Benchmarking Large Language Models for Cyberbullying Detection in YouTube Comments’. arXiv. Available at: https://doi.org/10.48550/arXiv.2505.18927.
Sen, M. (2026) ‘Grok AI Statistics 2026: Users, Revenue, Adoption and Market Share’, Panto Blog. Available at: https://www.getpanto.ai/blog/grok-ai-statistics.