Evaluating General-Purpose AI with Psychometrics

Wang, Xiting; Jiang, Liming; Hernández‐Orallo, José; Stillwell, David; Chen, Shiqiang; Sun, Luning; Luo, Fang; Xie, Xing

doi:10.1145/3769688

preprintCommunications of the ACMApr 14, 2026HYBRID OA

Evaluating General-Purpose AI with Psychometrics

XWXiting WangLJLiming JiangJHJosé Hernández‐Orallo DSDavid StillwellSCShiqiang Chen

Beijing Academy of Artificial Intelligence · Annoroad Gene Technology (China) · +8 more institutions

Indexed inarxivcrossrefdatacite

Abstract

Rigorous evaluation of general-purpose AI systems such as large language models should allow for deepened understanding of their capabilities and effective mitigation of their risks. The current evaluation paradigm, mostly reliant on benchmarks aggregating scores on one or more tasks, lacks the scientific machinery for predicting performance on unforeseen tasks and explaining the variability of results. Moreover, existing benchmarks raise growing concerns about their reliability and validity. To tackle these challenges, we vindicate psychometrics, the science of psychological measurement, as a methodology for identifying and measuring constructs that underlie AI performance across multiple tasks. To raise…

Citation impact

6

total citations

FWCI: 48.70
Percentile: 99%
References: 0

Citations per year

Authors

8

XW
Xiting WangCorresponding
Beijing Academy of Artificial Intelligence, Annoroad Gene Technology (China), Renmin University of China, Chinese Academy of Governance
LJ
Liming Jiang
Beijing Normal University, Microsoft Research Asia (China)
JH
José Hernández‐Orallo
Leverhulme Trust, Generalitat Valenciana, Universitat Politècnica de València
DS
David Stillwell
University of Cambridge
SC
Shiqiang Chen
Beijing Academy of Artificial Intelligence, Renmin University of China

Topics & keywords

Topics

Explainable Artificial Intelligence (XAI)89%

Keywords

Computer science
Psychometrics
Construct (python library)
Task (project management)
Reliability (semiconductor)
Data science
Construct validity
Management science

UN Sustainable Development Goals

Peace, Justice and strong institutions

No related works found for this paper.