articleProceedings of the National Academy of SciencesOct 29, 2024HYBRID OA

Evaluating large language models in theory of mind tasks

University of Virginia · Stanford University

PubMed
Indexed incrossrefpubmed

Abstract

Eleven large language models (LLMs) were assessed using 40 bespoke false-belief tasks, considered a gold standard in testing theory of mind (ToM) in humans. Each task included a false-belief scenario, three closely matched true-belief control scenarios, and the reversed versions of all four. An LLM had to solve all eight scenarios to solve a single task. Older models solved no tasks; Generative Pre-trained Transformer (GPT)-3-davinci-003 (from November 2022) and ChatGPT-3.5-turbo (from March 2023) solved 20% of the tasks; ChatGPT-4 (from June 2023) solved 75% of the tasks, matching the performance of 6-y-old children observed in past studies. We explore the potential interpretation of these results, including…

Citation impact

169
total citations
FWCI
117.31
Percentile
100%
References
68
Citations per year

Authors

1

Topics & keywords

Keywords
  • False belief
  • Theory of mind
  • Bespoke
  • Task (project management)
  • Inference
  • Cognitive psychology
  • Comprehension
  • Computer science
UN Sustainable Development Goals
  • Quality Education
No related works found for this paper.