Sources of truth in the AI age

Katy Kookaburra — Mon, 15 Sep 2025 00:00:00 GMT

What happens when an AI can do a better job making value judgements than humans? What does “better” even mean? As part of my PhD research, I collected a series of tweets from two football competitions and asked a human panel of annotators to moderate them for the presence of hate speech. The idea was to study the ability of various machine learning models to detect hate speech correctly and to identify patterns in their strengths and weaknesses, comparing them to reliable labels created by well-trained humans. The initial classification was done by slightly-out-of-date fine-tuned BERT technology which often made mistakes (mostly false positives). Running the same experiment with modern large language models like Deepseek or Llama, the improvement in the results was obvious. After looking at the false positives, I decided to try doing the same for false negatives (i.e. tweets the panel thought were hateful, but which the models didn’t).

This category was far more interesting, as I often found myself agreeing with the models over the annotators. I had been working on the assumption that the humans were a source of truth but it became clear that mistakes crept in- fairly unambiguously “non-hate” tweets were getting labelled as hateful in the annotation process. There are plenty of possible explanations for this including fatigue or a confusing user interface. I am really grateful for my annotators and do not consider this evidence of poor work! What is clear that human annotators were susceptible to certain classes of mistake that LLMs were not.

The idea that computational moderation might sometimes be more reliable than human teams is alarming as it delegates moral responsibility to an entity incapable of taking accountability (to adapt that old IBM quote). It seems unlikely anyone would be happy with an entirely automated approach but there are plenty of “blended” human-in-the-loop approaches. An LLM could be adapted with a classification head to output a probability that a given post is hate speech, so the most obviously bad content is never seen by a moderator. All users could be entitled to a human review on appeal.

But ultimately this is perhaps a human problem rather than a technical one. Avoiding the issues of fatigue from having a dedicated moderator role is tough- perhaps a more holistic approach so someone isn’t doing that work for hours on end is the best way forward, rather than the Fordist model of work where workers are interchangeable and replaceable to different roles with minimal retraining.

Jess's Blog

Sources of truth in the AI age