OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles

Lison, Pierre; Tiedemann, Jörg

doi:10.63317/3fi26b3nobqg

articleMay 1, 2016GOLD OA

OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles

PLPierre Lison JTJörg Tiedemann

University of Oslo · University of Helsinki

Indexed incrossref

Abstract

We present a new major release of the OpenSubtitles collection of parallel corpora. The release is compiled from a large database of movie and TV subtitles and includes a total of 1689 bitexts spanning 2.6 billion sentences across 60 languages. The release also incorporates a number of enhancements in the preprocessing and alignment of the subtitles, such as the automatic correction of OCR errors and the use of meta-data to estimate the quality of each subtitle and score subtitle pairs.

Citation impact

685

total citations

FWCI: 92.32
Percentile: 100%
References: 9

Citations per year

Authors

2

Topics & keywords

Topics

Keywords

Subtitle
Computer science
Preprocessor
Natural language processing
Data pre-processing
Information retrieval
Artificial intelligence
Speech recognition

No related works found for this paper.