Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

Liu, Jiawei; Xia, Chunqiu Steven; Wang, Yuyao; Zhang, Lingming

doi:10.48550/arxiv.2305.01210

preprintarXiv (Cornell University)May 2, 2023GREEN OA

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

JLJiawei Liu CSChunqiu Steven Xia YWYuyao Wang LZLingming Zhang

Indexed inarxivdatacite

Abstract

Program synthesis has been long studied with recent approaches focused on directly using the power of Large Language Models (LLMs) to generate code. Programming benchmarks, with curated synthesis problems and test-cases, are used to measure the performance of various LLMs on code synthesis. However, these test-cases can be limited in both quantity and quality for fully assessing the functional correctness of the generated code. Such limitation in the existing benchmarks begs the following question: In the era of LLMs, is the code generated really correct? To answer this, we propose EvalPlus -- a code synthesis evaluation framework to rigorously benchmark the functional correctness of LLM-synthesized code.…

Citation impact

173

total citations

FWCI: —
Percentile: —
References: 0

Citations per year

Authors

4

Topics & keywords

Topics

Keywords

Benchmark (surveying)
Correctness
Code (set theory)
Computer science
Test (biology)
Ranking (information retrieval)
Code generation
Code coverage

No related works found for this paper.

Funding

NS
National Science Foundation
Awards: CCF-2131943, CCF-2141474, 2131943, 2141474