Spanish Pre-trained BERT Model and Evaluation Data

Cañete, José; Chaperon, Gabriel; Fuentes, Rodrigo; Ho, Jou-Hui; Kang, Ho-Jin; Pérez, Jorge Eduardo Pérez

doi:10.48550/arxiv.2308.02976

preprintarXiv (Cornell University)Aug 6, 2023GREEN OA

Spanish Pre-trained BERT Model and Evaluation Data

JCJosé Cañete GCGabriel Chaperon RFRodrigo Fuentes JHJou-Hui Ho HKHo-Jin Kang

Indexed inarxivdatacite

Abstract

The Spanish language is one of the top 5 spoken languages in the world. Nevertheless, finding resources to train or evaluate Spanish language models is not an easy task. In this paper we help bridge this gap by presenting a BERT-based language model pre-trained exclusively on Spanish data. As a second contribution, we also compiled several tasks specifically for the Spanish language in a single repository much in the spirit of the GLUE benchmark. By fine-tuning our pre-trained Spanish model, we obtain better results compared to other BERT-based models pre-trained on multilingual corpora for most of the tasks, even achieving a new state-of-the-art on some of them. We have publicly released our model, the…

Citation impact

336

total citations

FWCI: —
Percentile: —
References: 0

Citations per year

Authors

6

Topics & keywords

Topics

Keywords

Computer science
Language model
Benchmark (surveying)
Task (project management)
Bridge (graph theory)
Natural language processing
Artificial intelligence
Training set

UN Sustainable Development Goals

Quality Education

No related works found for this paper.