C-Pack: Packed Resources For General Chinese Embeddings
Beijing Academy of Social Sciences · Renmin University of China · +1 more institution
Abstract
We introduce C-Pack, a package of resources that significantly advances the field of general text embeddings for Chinese. C-Pack includes three critical resources. 1) C-MTP is a massive training dataset for text embedding, which is based on the curation of vast unlabeled corpora and the integration of high-quality labeled corpora. 2) C-MTEB is a comprehensive benchmark for Chinese text embeddings covering 6 tasks and 35 datasets. 3) BGE is a family of embedding models covering multiple sizes. Our models outperform all prior Chinese text embeddings on C-MTEB by more than +10% upon the time of the release. We also integrate and optimize the entire suite of training methods for BGE. Along with our resources on…
Citation impact
- FWCI
- 78.97
- Percentile
- 100%
- References
- 13
Authors
6Topics & keywords
- Packed bed
- Computer science
- Chemistry
- Chromatography