Holistic Evaluation of Language Models
Stanley Foundation · Stanford University
Indexed incrossrefpubmed
Abstract
Language models (LMs) like GPT-3, PaLM, and ChatGPT are the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood. We present Holistic Evaluation of Language Models (HELM) to improve the transparency of LMs. LMs can serve many purposes and their behavior should satisfy many desiderata. To navigate the vast space of potential scenarios and metrics, we taxonomize the space and select representative subsets. We evaluate models on 16 core scenarios and 7 metrics, exposing important trade-offs. We supplement our core evaluation with seven targeted evaluations to deeply analyze specific aspects (including world knowledge, reasoning,…
Citation impact
434
total citations
- FWCI
- 71.30
- Percentile
- 100%
- References
- 77
Citations per year
Authors
3Topics & keywords
Keywords
- Computer science
No related works found for this paper.