preprintarXiv (Cornell University)May 2, 2023GREEN OA

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

Indexed inarxivdatacite

Abstract

Program synthesis has been long studied with recent approaches focused on directly using the power of Large Language Models (LLMs) to generate code. Programming benchmarks, with curated synthesis problems and test-cases, are used to measure the performance of various LLMs on code synthesis. However, these test-cases can be limited in both quantity and quality for fully assessing the functional correctness of the generated code. Such limitation in the existing benchmarks begs the following question: In the era of LLMs, is the code generated really correct? To answer this, we propose EvalPlus -- a code synthesis evaluation framework to rigorously benchmark the functional correctness of LLM-synthesized code.…

Citation impact

173
total citations
FWCI
Percentile
References
0
Citations per year

Authors

4

Topics & keywords

Keywords
  • Benchmark (surveying)
  • Correctness
  • Code (set theory)
  • Computer science
  • Test (biology)
  • Ranking (information retrieval)
  • Code generation
  • Code coverage
No related works found for this paper.

Funding