preprintCommunications of the ACMApr 14, 2026HYBRID OA

Evaluating General-Purpose AI with Psychometrics

XWXiting WangLJLiming JiangJHJosé Hernández‐OralloDSDavid StillwellSCShiqiang Chen

Beijing Academy of Artificial Intelligence · Annoroad Gene Technology (China) · +8 more institutions

Indexed inarxivcrossrefdatacite

Abstract

Rigorous evaluation of general-purpose AI systems such as large language models should allow for deepened understanding of their capabilities and effective mitigation of their risks. The current evaluation paradigm, mostly reliant on benchmarks aggregating scores on one or more tasks, lacks the scientific machinery for predicting performance on unforeseen tasks and explaining the variability of results. Moreover, existing benchmarks raise growing concerns about their reliability and validity. To tackle these challenges, we vindicate psychometrics, the science of psychological measurement, as a methodology for identifying and measuring constructs that underlie AI performance across multiple tasks. To raise…

Citation impact

6
total citations
FWCI
48.70
Percentile
99%
References
0
Citations per year

Authors

8

Topics & keywords

Keywords
  • Computer science
  • Psychometrics
  • Construct (python library)
  • Task (project management)
  • Reliability (semiconductor)
  • Data science
  • Construct validity
  • Management science
UN Sustainable Development Goals
  • Peace, Justice and strong institutions
No related works found for this paper.

Funding