Evaluating General-Purpose AI with Psychometrics
Beijing Academy of Artificial Intelligence · Annoroad Gene Technology (China) · +8 more institutions
Abstract
Rigorous evaluation of general-purpose AI systems such as large language models should allow for deepened understanding of their capabilities and effective mitigation of their risks. The current evaluation paradigm, mostly reliant on benchmarks aggregating scores on one or more tasks, lacks the scientific machinery for predicting performance on unforeseen tasks and explaining the variability of results. Moreover, existing benchmarks raise growing concerns about their reliability and validity. To tackle these challenges, we vindicate psychometrics, the science of psychological measurement, as a methodology for identifying and measuring constructs that underlie AI performance across multiple tasks. To raise…
Citation impact
- FWCI
- 48.70
- Percentile
- 99%
- References
- 0
Authors
8- XWXiting WangCorresponding
Beijing Academy of Artificial Intelligence, Annoroad Gene Technology (China), Renmin University of China, Chinese Academy of Governance
- LJLiming Jiang
Beijing Normal University, Microsoft Research Asia (China)
- JHJosé Hernández‐Orallo
Leverhulme Trust, Generalitat Valenciana, Universitat Politècnica de València
- DSDavid Stillwell
University of Cambridge
- SCShiqiang Chen
Beijing Academy of Artificial Intelligence, Renmin University of China
Topics & keywords
- Computer science
- Psychometrics
- Construct (python library)
- Task (project management)
- Reliability (semiconductor)
- Data science
- Construct validity
- Management science
- Peace, Justice and strong institutions
Funding
- MRMicrosoft Research
- NNNational Natural Science Foundation of ChinaAwards: 10.13039, 62377003
- GVGeneralitat ValencianaAwards: CIPROM/2022/6, 501100011033
- DADefense Advanced Research Projects Agency
- MRMicrosoft Research Asia
- EREuropean Regional Development FundAwards: MCIN/AEI/10, PID2021-122830OB-C42, 13039/501100011033, 501100011033, RTI2018
- AEAgencia Estatal de InvestigaciónAwards: 501100011033, 13039, PID2021-122830OB-C42, 10.13039, AEI/10, 13039/501100011033, AEI/10.