preprintarXiv (Cornell University)May 9, 2023GREEN OA

StarCoder: may the source be with you!

Indexed inarxivdatacite

Abstract

The BigCode community, an open-scientific collaboration working on the responsible development of Large Language Models for Code (Code LLMs), introduces StarCoder and StarCoderBase: 15.5B parameter models with 8K context length, infilling capabilities and fast large-batch inference enabled by multi-query attention. StarCoderBase is trained on 1 trillion tokens sourced from The Stack, a large collection of permissively licensed GitHub repositories with inspection tools and an opt-out process. We fine-tuned StarCoderBase on 35B Python tokens, resulting in the creation of StarCoder. We perform the most comprehensive evaluation of Code LLMs to date and show that StarCoderBase outperforms every open Code LLM that…

Citation impact

192
total citations
FWCI
Percentile
References
0
Citations per year

Authors

67

Topics & keywords

Keywords
  • Python (programming language)
  • Computer science
  • Open source
  • Programming language
  • Source code
  • Tracing
  • Context (archaeology)
  • License
No related works found for this paper.

Funding