AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

Zhong, Wanjun; Cui, Ruixiang; Guo, Yiduo; Liang, Yaobo; Lü, Shuai; Wang, Yanlin; Saied, Amin; Chen, Weizhu; Duan, Nan

doi:10.18653/v1/2024.findings-naacl.149

articleJan 1, 2024GOLD OA

AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

WZWanjun Zhong RCRuixiang Cui YGYiduo Guo YLYaobo Liang SLShuai Lü

Indexed incrossref

Abstract

Assessing foundation models' abilities for human-level tasks is crucial for Artificial General Intelligence (AGI) development.Traditional benchmarks, which rely on artificial datasets, may not accurately represent these capabilities.In this paper, we introduce AGIEval, a novel bilingual benchmark designed to assess foundation models in the context of humancentric standardized exams, such as college entrance exams, law school admission tests, math competitions, and lawyer qualification tests.We evaluate several state-of-the-art foundation models on our benchmark.Impressively, we show that GPT-4 exceeds the average human performance in SAT, LSAT, and math contests, with 95% accuracy on SAT Math and 92.5% on the…

Citation impact

122

total citations

FWCI: 29.47
Percentile: 100%
References: 0

Citations per year

Authors

9

Topics & keywords

Topics

BIM and Construction Integration25%

Keywords

Benchmark (surveying)
Foundation (evidence)
Computer science
Geology

No related works found for this paper.