Evaluating large language models in theory of mind tasks
University of Virginia · Stanford University
Abstract
Eleven large language models (LLMs) were assessed using 40 bespoke false-belief tasks, considered a gold standard in testing theory of mind (ToM) in humans. Each task included a false-belief scenario, three closely matched true-belief control scenarios, and the reversed versions of all four. An LLM had to solve all eight scenarios to solve a single task. Older models solved no tasks; Generative Pre-trained Transformer (GPT)-3-davinci-003 (from November 2022) and ChatGPT-3.5-turbo (from March 2023) solved 20% of the tasks; ChatGPT-4 (from June 2023) solved 75% of the tasks, matching the performance of 6-y-old children observed in past studies. We explore the potential interpretation of these results, including…
Citation impact
- FWCI
- 117.31
- Percentile
- 100%
- References
- 68
Authors
1Topics & keywords
- False belief
- Theory of mind
- Bespoke
- Task (project management)
- Inference
- Cognitive psychology
- Comprehension
- Computer science
- Quality Education