How Human And AI Judgements Differs

A series of experiments comparing how people and large language models respond to the same judgment tasks found that the two groups often reach similar conclusions but through processes that differ in a fundamental way.

Walter Quattrociocchi, a professor of computer science at Sapienza University of Rome who studies human judgment and information dynamics, conducted the work with colleagues. His central concern is not that language models produce wrong answers they often don’t but that they produce plausible-sounding answers through a process that cannot, in principle, distinguish a reliable claim from an unreliable one.

Decades of psychology and neuroscience research have documented how humans evaluate credibility, assess moral situations, and reason about cause and effect. These processes involve checking new information against prior experience, applying knowledge of context and source reliability, and using causal models of how events relate to each other. Large language models have no perceptual experience of the world. They are trained on statistical patterns in text, with no access to the events, objects, or relationships those words describe.

In one experiment, 50 human participants and six LLMs were each presented with a set of news sources and asked to rate each source’s credibility and explain their reasoning. In separate experiments, the same populations were given moral dilemmas and asked to reason through them. In both cases, the researchers examined not just the conclusions but the justifications what criteria were cited and what kind of reasoning was offered.

The source document is a magazine essay rather than a full research paper, and methodological details participant recruitment, LLM selection criteria, and statistical analysis are not described. The findings should be read as exploratory.

On the credibility task, even when LLMs reached verdicts similar to those of human participants, the justifications consistently differed. Human participants grounded their assessments in things like a source’s known track record, consistency with established facts, and whether a claim fit plausibly into a broader chain of events. LLM justifications reflected linguistic patterns how certain word combinations typically appear together rather than any reference to external events or experience.

On the moral dilemma tasks, LLMs produced language that mirrors the vocabulary humans use when deliberating about ethics, including causal and counterfactual phrasing. Quattrociocchi notes that this surface resemblance is not the same as reasoning: “The model is not imagining anything or engaging in any deliberation; it is just reproducing patterns in people’s speech or writing about these counterfactuals.”

The consistent pattern across tasks, as the researchers describe it: “Where a human judges, a model correlates. Where a human evaluates, a model predicts. Where a human engages with the world, a model engages with a distribution of words.”

To describe the interpretive problem this creates, the researchers introduce the term epistemia: a situation in which the simulation of knowledge becomes indistinguishable, to the observer, from knowledge itself. The mechanism they identify is specific. Human readers are primed to treat fluency as a signal of understanding. When a response is grammatically smooth, well-structured, and uses the right vocabulary, it reads as credible even when the process generating it has no access to the facts being described. Quattrociocchi writes that the error “happens because the model is fluent, and fluency is something human readers are primed to trust.”

This is what the researchers argue makes the problem qualitatively different from a model simply being wrong. A model that was reliably wrong could be corrected. The deeper issue is that a language model cannot represent truth in any form that would allow it to distinguish a reliable from an unreliable claim — except by pattern-matching to prior text. It cannot check its outputs against the world, because it has no access to the world. “It cannot form beliefs, revise them or check its output against the world,” Quattrociocchi writes.

The researchers conclude that LLMs are powerful tools when used as what they are: instruments of linguistic automation suited to drafting, summarizing, and recombining ideas. The concern arises when they are used in domains law, medicine, psychology where distinguishing plausibility from truth is operationally necessary. In those contexts, the researchers argue, human oversight is required precisely because the model lacks access to what judgment ultimately depends on.

The small sample sizes (50 people, six LLMs) and the absence of full methodological reporting mean these findings cannot be treated as definitive. The essay does not address whether different LLM architectures, different prompting strategies, or different task domains might produce different patterns. The concept of epistemia is introduced as a theoretical framing rather than a formally validated construct.