StarCoder: may the source be with you!
Indexed inarxivdatacite
Abstract
The BigCode community, an open-scientific collaboration working on the responsible development of Large Language Models for Code (Code LLMs), introduces StarCoder and StarCoderBase: 15.5B parameter models with 8K context length, infilling capabilities and fast large-batch inference enabled by multi-query attention. StarCoderBase is trained on 1 trillion tokens sourced from The Stack, a large collection of permissively licensed GitHub repositories with inspection tools and an opt-out process. We fine-tuned StarCoderBase on 35B Python tokens, resulting in the creation of StarCoder. We perform the most comprehensive evaluation of Code LLMs to date and show that StarCoderBase outperforms every open Code LLM that…
Citation impact
192
total citations
- FWCI
- —
- Percentile
- —
- References
- 0
Citations per year
Authors
67Topics & keywords
Keywords
- Python (programming language)
- Computer science
- Open source
- Programming language
- Source code
- Tracing
- Context (archaeology)
- License
No related works found for this paper.