Qualitative analysis

To supplement the quantitative analysis, we looked in detail at those posts that generated the most disagreement between annotators and identified several themes amiong them: lack of understanding of context, ambiguity of humor, the importance of cultural contexts, the unclear definition of what is an insult, and the line between political opinions and offensiveness.

Humor, Sarcasm, and the Nuance of Context

A recurrent limitation of transformer-based models such as HateSpeech BERT lies in their reliance on isolated sentence approaches and specific lexical triggers rather than an understanding of specific conversational environments. Humans, on the other hand, often communicate through ‘coded’ or ‘distorted’ language to discuss sensitive topics, which includes metaphors or ‘gossip’ vocabularies (Tedeschini & Fasulo, 2026). Studies have also shown that AI models often conflate AAVE or phonetic spellings with proxies for aggression, which can lead to false positives in neutral conversations (PsyPost, 2025).

One of the most significant issues emerging from our analysis is semantic reclamation, or, the process of marginalized groups reappropriating slurs or derogatory terms in order to signal solidarity, pride, or unity (Popa-Wyatt, 2020). In their 2026 study AIWizards at MULTIPRIDE: A Hierarchical Approach to Slur Reclamation Detection, Tedeschini and Fasulo argue that traditional NLP models fail in their categorization of such speech because they treat slurs as clear, nonnegotiable symbols of hate rather than terms that are potentially of a dynamic sociolinguistic nature. The authors argue that the ‘offensive potential’ of a term is not inherent to the word itself, but rather is contingent on the social identity and intent of the speaker. This tracks in large part with our findings, as in many cases where the LLM clearly identified hate speech (likely based on a ‘trigger’ keyword), the human annotators hesitated because it seemed the user either belonged to that group themself; was potentially using it inoffensively toward someone they know; was potentially reclaiming and directing it at another user being hateful. The authors suggest that without identifying “reclamation signals”, such as community-specific identifiers or bio-data, models will default to the most conservative, or restrictive, labeling. Their success in using hierarchical modeling to distinguish between abusive expression and “in-group affirmations” reiterates the lack of pragmatic reasoning regarding the speaker’s potential relationship to the slur in the HateSpeech BERT model.

An example of where both human annotators labeled the comment as ‘normal’ while both LLMs labeled it ‘offensive’ was, “i ain’t even gone speak on it just know niggas ain’t who they claim to be on this social media shit” (Twitter-026). In this case, both human annotators assumed that the speaker likely identified as Black and interpreted the language as communal rather than derogatory. Another case is “nigga at work said i look like a subway surfer” (Twitter-163) which both human annotators took as humorous in-group slang rather than an ethnic slur against another party. A third example is “taking down these knotless braids ghetto ass fuck i quit 😂” (Twitter-248). In this case, the specific mention of ‘knotless braids’ served as a clear cultural signifier that allowed the human annotators to identify the speaker as Black, consequently interpreting the term as a self-deprecating humor, whereas the LLM likely flagged the terms ‘ghetto’ and ‘fuck’ as non-negotiable indicators of offensive speech. This is particularly insightful for our analysis, as it shows that the high ‘de-escalation’ rate by our human annotators is not necessarily indicative of higher leniency, but potentially a reflection of the ability to process context in a way that contemporary AI systems still lack.

A further limitation of transformer-based hate-speech classifiers is their difficulty interpreting humour, irony, and exaggeration in this context. Humour plays a significant role in understanding speech in general and online in particular, with specific forms of humor often pertinent to certain cultural or trend contexts significantly changing the meaning of statements. Fahim et al. (2024) find that humor remains “a hard barrier for machines to decipher” as it relies on “intonation, irony, and contextual nuances” (p.1). Unlike with literal statements, humorous posts often include a mismatch between literal wording and the intent of a speaker, which creates an ambiguity text models struggle to resolve. This was very visible in our dataset where the analysis of statements that were ranked with great discrepancies between human and AI annotators frequently featured humour. Our group more often qualified humorous statements as “normal” or “offensive” where LLMs or MTurk annotators tended towards hate speech more often. This suggests that the models still rely heavily on trigger words and sentiment signals rather than being able to decipher intent, and might overmoderate offensive humor online as hate speech.

This distinction between humor and self-referentialism or slur reclamation is relevant because reappropriation of terms requires the speaker to be an adherent of the targeted subgroup. In the case of humor, this requirement falls away, e.g. a man making a joke about all women - however, the decisive factor is comedic effect and intent. Examples of this within our dataset where at least one human annotator labeled “normal” and none labeled “hate speech”, while the LLMs only labelled either “offensive” or “hate speech”, included posts such as, “she say bre you fine i am like bitch break ya back for a stack hoe 🥶” (Twitter-307); “i heard a black girl hit a security guard dyke wit the meanest “ shut up bitch ” i ever heard in atlanta i was cryin dog” (Twitter-324), and “all jokes to the front niggas be queer” (Twitter-332). While the use of derogatory language is unquestionable, and was usually labelled as offensive, human annotators agreed that these were expressions of humor with no clear intent to insult or discriminate against a specific group. In all cases, the LLMs found the posts to be either offensive or hate speech, indicating a similar issue to the reclamation dilemma in that the model does not recognize the nuance of humor within language.

A final driver of AI over-classification is the inability to correctly identify the target of the speech. Reviewing our results qualitatively, many posts that were flagged by Claude and Grok could have been interpreted as self-deprecation or humor. This is significant, as human annotators are far more likely to tag a post as ‘normal’ or ‘offensive’ if the slur was targeted at oneself (Chance et al., 2026). Chance et al. (2026) find that majority-rule datasets, which are standard for training LLms, create a “sociological blindness” by defaulting to the most restrictive interpretation of a word, hence failing to capture “the IYKYK (If You Know, You Know), nuances of community-specific speech”. In many ways, the tendency by AI models to apply a narrower definition of acceptable speech in the name of safety creates a paradox in which, by attempting to protect marginalized groups, models potentially inadvertently suppress their unique forms of expression, which include irony and self-referential humor.

Cultural contexts in classification

The classification of Hate Speech is not a neutral act. Every classification decision is influenced by the cultural, linguistic, and biographical position of the annotator, shaping what is conceived as offensive, harmful, or normal. This section examines how divergences between our group’s classifications and those of the MTurk annotators can be partially explained through the lens of cultural literacy and socio-political context.

The research group producing this paper is heterogeneous along several dimensions. They were raised and socialised on three continents, speak different languages, and hold four nationalities. Although Anglophone pop culture has had a significant impact on their lives as they grew up, English is not the first language of most of them. Their shared denominators are also significant. They have been raised in similar socio-economic circumstances, are pursuing the same postgraduate degree, are functionally multilingual and have been subject to transatlantic political and media cultures.

The MTurk annotator pool, by contrast, is a black box. HateXplain provides no demographic breakdown of its human annotators. Research on the demographics of the MTurk labour market suggests that over time, the worker population shifted from a primarily US-based workforce to an increasingly international hub of younger, well-educated Indians, for which it functions as a full-time employment (Ross et al., 2010).

Empirical research in computational linguistics and social psychology found that cultural context is an important factor in hate speech identification. Sap et al. (2019) show that annotators from racially marginalised groups are more likely to classify racist content as offensive or hate speech. They attribute this finding to differences in lived experience rather than to oversensitivity. Similarly, Waseem (2016) shows that annotators who have training or qualifications in certain topics, such as familiarity with the feminist discourse, identify misogynistic content at substantially higher rates than crowdworkers without that context. Therefore, it can be concluded that the differences between hate speech and mere offensiveness are sensitive to shared norms within the annotator’s background as well as their in-group and out-group affiliation.

“charlottesville did not have the intended impact only one died no one arrested no neighborhoods burned no police cars torched no stores looted no brick rock throwing at police dindunuffins dindunuffin liquor stores intact non event for antifa blm baltimore should provide much more lively targets” (GAB-127)

The GAB-127 post is a prime example of the importance of cultural context when classifying hate speech and how the research group classified it drastically different than the MTurks. All three classified it as normal, while both members of the research group classified it as hate speech. The event referred to in the post was the white supremacist “Unite the Right” rally in Charlottesville in August 2017. The rally escalated when protesters encountered counter-protestors, and a white supremacist ran his car into a crowd, injuring 35 and killing Heather Heyer (Spencer, 2017). The author of post GAB-127 framed the event as not having the “intended impact” with “only one person died no one arrested no neighborhoods burned”, which presents the violence as having failed to achieve its goal. This lamentation of the lack of effectiveness in violence against counter protesters and specifically dindunuffins – a slur weaponised within the white nationalist online community as a derogatory dismissal of Black victims of police violence – classifies this post well within the definition set for hate speech above, as it encourages violence against protected minorities. Without knowledge of the rally, its white supremacist context and the specific deployment of “dindunuffin” in far-right discourse, the post is genuinely difficult to classify as anything other than political commentary. MTurk annotators, plausibly, lacked context, especially if they were not American, and misclassified it accordingly. The research group identified the post as hate speech.

“the policy was wonderfully effective it succeeded in keeping insane kike with taco beaner last name nicholas cruz from suffering the fate of the school to prison pipeline everybody sing kumbaya kumbaya” (GAB-208)

Post GAB-208 inverts the pattern seen above, where the three MTurks classified it as Hate Speech, whereas both group annotators marked it as normal. The policy the post refers to is the PROMISE discipline program, created in 2013, which required schools in Broward County, Fla., to refer students to an alternative disciplinary program instead of law enforcement for a list of non-violent offences. Many conservative commentators argued that this program made it possible for Cruz to conduct the school shooting (Blad, 2018). Therefore, the author sarcastically criticises the policy for its ineffectiveness. Although the post contains religious and ethnic slurs such as “kike” and “taco beaner”, it does not incite violence and should not be classified as Hate Speech. The MTurks potentially knew about the school shooting and automatically saw it as an act of incitement without spotting the criticism towards the PROMISE programme. On the other hand, the group of annotators possibly missed the slurs and classified them as Normal.

In sum, it is virtually impossible to classify hate speech neutrally, as personal sentiments influence the annotation and the perception of gravity. The examples point to a limitation in the HateXplain dataset that goes beyond individual annotation errors. When cultural context is both a necessary resource for accurate classification and unevenly distributed across annotator pools, inter-annotator agreement becomes a poor proxy for annotation validity.

When is something an insult?

Insults seem easy to perceive, yet hard to categorise. Some scholars define an insult as “a conspicuous lack of respect for another’s dignity”, yet what counts as a “lack of respect” may change depending on the individual, the context and the core perception of phrasing and meaning (Daly, 2018). Complicating an accurate analysis further is the fact that insults, and what this means, “remains one of the most overlooked […] and underexamined features of everyday social interaction” (Conley, 2010, p.2). Conley provides another definition: “an expression of a severely negative opinion of a person or group in order to subvert their positive self-regard and esteem” (Conley, 2010, p.2). The issue here now lies with understanding the intent of the author of the expression and whether it was meant to “subvert their positive self-regard and esteem”. Furthermore, understanding the power of the text to have such an impact is another field of uncertainty when classifying content as an insult if this definition is followed. Whether an insult is intended or not can also influence how it is perceived, and whether it is classified as one or not by the individual observing. These differences in perception of insults translated to the classification of the posts on this study’s dataset. These differences in classification, seen within the scope of this study but likely present in datasets used for AI training, ultimately have an impact on how AI models later classify posts on social media.

“ whoever picked mineecraft is fuckign retarded” (TWI-013)

“i just thought about wtf i am doing and i am fucking retarded” (TWI-041)

“this category is so gay it belongs as a subreddit in the fagposts category” (GAB-471)

In the three posts we see above, taken from this study’s dataset, we see three instances where one can question their status as an insult, and therefore translated into differences in classifications ranging from “Normal” to “Hate Speech”. Indeed, in all three of the examples above, no group of individuals is directly targeted by the post. None of the posts target a certain group directly. Nevertheless, the first two use a derogatory term to describe an experience and the last (GAB-471) uses an identity to portray negative connotations. Despite their obvious negative consequences on the groups that are utilised by the authors of the posts to describe certain experiences, none are directly aimed at a group or an individual. This raises two key issues, one logistical and one societal. The first is whether these posts can be classified as insults, given that the content of the posts can not be classified as insulting to the direct target chosen. The second is whether the use of these derogatory terms, or groups as negative connotation carriers, can be considered as insults given that the connotations they carry do not arise from the use of one individual but a larger society.

“beat a guy from bejing in the world jousting final found a chink in his armour” (GAB-045)

The post above (GAB-045) also caused divergence between annotators, varying from ‘Normal’ to ‘Offensive’. Indeed, whether this can be considered as an insult is debatable. The negative implications of the slur used in the post cannot be countered or ignored, however, it can be questioned whether the author intended to insult the group impacted by the slur, in this case individuals of Chinese origin. Moreover, much like the previous posts analysed in this sub-section, there seems to be no direct target group impacted within the syntax of this post. Whether solely the existence of a slur is enough to deem the post an insult, and therefore ‘Offensive’ is unclear. Finally, complicating the case of this particular post further is the dual definition of the word “chink”, also meaning an opening, usually of light. It is plausible that the word was used in this context, and it is plausible it was used in both meanings at the same time. Here, we see several issues with the perception of the syntax of the post by the annotators, yielding uncertainties and differences in the annotation and classification, and hence impacting the AI models trained with the labels.

“i noticed how different the letters on gab look and i started freaking out becuz i thought i must have caught autism from this retarded kid that was coughing at walgreens 🙄” (057-GAB)

Are posts aimed clearly at an individual who is not identified, or identifiable, considered an insult on social media? This uncertainty also caused differences within the annotations in our study’s database. Despite the clear characteristics of an insult, the post does not fit neatly into either the individual-to-group or the individual-to-individual categories of insults. The ability of a slight difference from the norm of what is typically considered an insult to cause differences in the classification of a post highlights the highly subjective nature of these classifications.

Political opinions

The following section will examine how political speech is categorised within the spectrum of normal, offensive, and hate speech. The operational definition of hate speech adopted in some computational and policy contexts treats it as speech expressing hatred towards a target group. Although sometimes it lacks the focus on “incitement” within hate speech that the current paper draws on to demarcate offensive from hate speech, it is still important to understand what other research treats as “target groups.” This includes groups targeted on the basis of their race, ethnicity, nationality, religion, sex, gender, sexual orientation, and disability, a list that mirrors the UN’s list of “targets of hate.”( United Nations, no date) Political identities and ideological positioning do not fall within this purview, a boundary that could be defended on the basis that political identities are chosen and not ascribed. Yet a substantial portion of the contested posts in this dataset concern political content, especially given that the data is sourced from Twitter and Gab which are platforms known for their political nature. Many of these posts express partisan ideologies, ideological affirmations or accusations, and politically loaded vocabulary used to applaud or reprimand the behaviour of politicians and other individuals. Recent research analysing Twitter/X usage among a representative sample of US adults found that approximately 33% of tweets produced are political in nature making the platform-level prevalence of political content non-trivial (Bestvater et al., 2022). The same study notes that classifying tweets as political required a machine learning classifier fine-tuned on 6,413 human-annotated tweets, illustrating that political/non-political is contested and resists simple rule-based identification. For the purposes of this section, political refers to any expression of belief or opinion on governance, public policy, social structure, ideology or named political figures and movements. The differences in how the various annotators (AI, in-house, and study annotators) classify these statements are notable and reveal two patterns.

The first pattern concerns AI’s inability to distinguish mentioning a political group from targeting one. An example of this is TWT-033:

“ remember when you collaborated with neo nazis on a seth rich conspiracy documentary" (TWT-033)

Both in-house annotators and two of the three study annotators classified the post as Normal, recognising it as a political accusation directed at a person. Grok, however, classified it as Hate Speech targeted towards Jews and labelled it as “Pro-Nazi” hate. This deviates significantly from the context of the post, which accuses someone of Nazi collaboration rather than endorsing or colluding with Nazism.

“do not you love when nazi wannabes complain about israel because hey like we do not want them here but we do not want them there either stupidity” (GAB-032)

Similarly, this post, which mocks “nazi wannabes” for their hypocrisy, is once again classified by Grok as Hate Speech towards Jews. It is also worth noting that the only other deviation from consensus came from Study Annotator #1, who classified it as Offensive towards Jews. In these cases, Grok treats the appearance of group-related vocabulary as sufficient evidence of group-targeted hate, regardless of whether the post deploys, criticises, or merely refers to that vocabulary.

Another pattern identified here is topic-associated escalation, wherein posts that do not overtly target a group through slurs or other demographic attacks are classified as hateful or offensive on the basis of the topic alone.

“young white people have been systematically brainwashed into believing that whites invented evil that no one else has ever inflicted any pain on anyone else and the world would be a perfectly harmonious garden of eden again if it wasn’t for those demonic damn caucasians” (GAB-147)

The example above articulates a politically conservative position about anti-white discourse. The post contains no slurs and deploys no group-directed attacks. All human annotators (both the in-house annotators and the study annotators) classified it as Normal, while both Grok and Claude classified it as Hate Speech and identified the target group as “White people.” Here, the LLMs appear to be responding to the topic itself rather than to any single triggering word.

“no my anti feminist beliefs stem from my great love and reverence for women forcing females to be the same as men is not only damaging for society but is unfair to women who must be forced to endure roles for which they were not biologically suited” (GAB-170)

Similarly, in the example above, the author articulates a conservative, anti-feminist position alongside their rationale for holding it. This post was unanimously classified as Normal by both LLMs, yet one in-house annotator and one study annotator classified it as Offensive, and another study annotator escalated it to Hate Speech. This dissonance illustrates the difficulty humans themselves face in determining whether political speech is intended to be offensive, hateful, or simply an expression of personal belief.

Another interesting example that highlights the difficulty human annotators face in differentiating the semantics of offensive speech from political expression is GAB-194:

“an illegal immigrant that came here to commit crimes is not the same as someone who comes to work abide by our american laws live in peace deportzarate justiceforkate” (GAB-194)

This post is a politically charged claim that closes with the hashtags #deportzarate and #justiceforkate. Referencing the 2015 killing of Kate Steinle by José Inez García Zárate, an undocumented immigrant whose case became a focal point of restrictionist immigration politics during the Trump campaign, the author implicitly draws a distinction between “good” and “bad” immigrants (Aleaziz, 2024). One in-house annotator and one study annotator classified this as Offensive, whereas both AI annotators classified it as Normal.

The shared difficulty experienced by both AI and human annotators in differentiating between normal, offensive, and hateful speech in political contexts is also visible in the following example:

“fakenews that would be the mainstream media cnn msm msnbc cbs nbc etc so please stop blaming us the voter and the alternative media for your stupidity and you going full retard” (GAB-012)

Both in-house annotators classified this post as Offensive and identified the target as “Journalists, Liberal.” Two of the three study annotators reached the same Offensive verdict, though without specifying a target group. Grok classified it as Offensive towards “Liberals,” while Claude classified it as Normal. When the target is named explicitly (the mainstream media), Grok identifies it “accurately” and arrives at the same verdict as most human annotators. Grok’s systematic failures, examined in the preceding cases, occur precisely when this surface clarity is absent.

Taken together, these examples expose a consistent feature of how political speech sits awkwardly within the offensive-to-hateful spectrum used in this dataset. The posts examined here show what happens when annotators encounter content that draws on the vocabulary of demographic hate (Nazis, neo-Nazis, “white people,” “feminists”) without explicitly directing that hate at a protected group. Grok responds primarily to the vocabulary, whereas Claude responds to the topic. The human annotators, including both the in-house annotators and the study annotators, respond inconsistently to both.

Different models systematically over-flagging certain conservative political speech and under-flagging politely framed prejudice has political implications for social media platforms as it means the suppression of certain discourse and the legitimation of other kinds of prejudice in civil tones. It is important to note that it is this perceived hyper-moderation of content that has resulted in the growth of platforms like Gab. Hence it is important to know that how these systems moderate hate speech online has implications for what is an acceptable political opinion and what it is not.

Citations

Blad, E. (2018) ‘Controversial discipline program not to blame for Parkland school shooting, commission finds’, Education Week, 10 July. Available at: https://www.edweek.org/leadership/controversial-discipline-program-not-to-blame-for-parkland-school-shooting-commission-finds/2018/07 (Accessed: 26 April 2026).

Conley, T.M. (2010). Toward a rhetoric of insult. Chicago: University of Chicago Press.

Daly, H.L. (2018). On Insults. Journal of the American Philosophical Association, 4(4), pp.510–524. doi:https://doi.org/10.1017/apa.2018.29.

Ross, J., Irani, L., Silberman, M. S., Zaldivar, A. and Tomlinson, B. (2010) ‘Who are the crowdworkers? Shifting demographics in Mechanical Turk’, CHI ‘10 Extended Abstracts on Human Factors in Computing Systems. New York: Association for Computing Machinery, pp. 2863–2872. https://doi.org/10.1145/1753846.1753873

Sap, M., Card, D., Gabriel, S., Choi, Y. and Smith, N. A. (2019) ‘The risk of racial bias in hate speech detection’, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 1668–1678.

Spencer, H. (2017) ‘A far-right gathering bursts into brawls’, The New York Times, 13 August. Available at: https://www.nytimes.com/2017/08/13/us/charlottesville-protests-unite-the-right.html (Accessed: 26 April 2026).

Waseem, Z. (2016) ‘Are you a racist or am I seeing things? Annotator influence on hate speech detection on Twitter’, Proceedings of the First Workshop on NLP and Computational Social Science, pp. 138–142. https://doi.org/10.18653/v1/W16-5618

United NationsAleaziz, H. (2024) ‘U.S. Plans to Deport Mexican Man Acquitted in Kathryn Steinle Case’, The New York Times, 29 February. Available at: https://www.nytimes.com/2024/02/29/us/politics/mexican-kathryn-steinle-deport.html (Accessed: 29 April 2026).

Bestvater, S. et al. (2022) ‘Political Behavior on Twitter among U.S. adults’, Pew Research Center, 16 June. Available at: https://www.pewresearch.org/politics/2022/06/16/politics-on-twitter-one-third-of-tweets-from-u-s-adults-are-political/ (Accessed: 29 April 2026).

Nations, U. (no date) Targets of hate, United Nations. United Nations. Available at: https://www.un.org/en/hate-speech/impact-and-prevention/targets-of-hate (Accessed: 29 April 2026).

Is AI Ready for Multimodal Hate Speech Detection? A Comprehensive Dataset and Benchmark Evaluation. (2026). arXiv. https://arxiv.org/html/2603.21686v1

IYKYK (But AI Doesn’t): Automated Content Moderation Does Not Capture Communities’ Heterogeneous Attitudes Towards Reclaimed Language. (2026). arXiv. https://arxiv.org/html/2604.16654v2

PsyPost. (2025). AI hate speech detectors show major inconsistencies. https://www.psypost.org/ai-hate-speech-detectors-show-major-inconsistencies-new-study-reveals/

Popa-Wyatt, Mihaela. “Reclamation: Taking Back Control of Words.” Grazer Philosophische Studien, vol. 97, no. 1, 4 Mar. 2020, pp. 159–176, https://doi.org/10.1163/18756735-09701009. ‌Tedeschini, Luca, and Matteo Fasulo. “AIWizards at MULTIPRIDE: A Hierarchical Approach to Slur Reclamation Detection.” ArXiv.org, 2026, arxiv.org/abs/2602.12818.