Evaluating large language models in theory of mind tasks

Kosiński, Michał

doi:10.1073/pnas.2405460121

articleProceedings of the National Academy of SciencesOct 29, 2024HYBRID OA

Evaluating large language models in theory of mind tasks

MKMichał Kosiński

University of Virginia · Stanford University

PubMed

Indexed incrossrefpubmed

Abstract

Eleven large language models (LLMs) were assessed using 40 bespoke false-belief tasks, considered a gold standard in testing theory of mind (ToM) in humans. Each task included a false-belief scenario, three closely matched true-belief control scenarios, and the reversed versions of all four. An LLM had to solve all eight scenarios to solve a single task. Older models solved no tasks; Generative Pre-trained Transformer (GPT)-3-davinci-003 (from November 2022) and ChatGPT-3.5-turbo (from March 2023) solved 20% of the tasks; ChatGPT-4 (from June 2023) solved 75% of the tasks, matching the performance of 6-y-old children observed in past studies. We explore the potential interpretation of these results, including…

Citation impact

169

total citations

FWCI: 117.31
Percentile: 100%
References: 68

Citations per year

Authors

1

MK
Michał KosińskiCorresponding
University of Virginia, Stanford University

Topics & keywords

Topics

Keywords

False belief
Theory of mind
Bespoke
Task (project management)
Inference
Cognitive psychology
Comprehension
Computer science

UN Sustainable Development Goals

Quality Education

No related works found for this paper.