The Cadenza lyric intelligibility prediction (CLIP) dataset

Roa-Dabike, Gerardo; Cox, Trevor J.; Barker, Jon; Fazenda, Bruno; Graetzer, Simone; Vos, Rebecca R.; Akeroyd, Michael A.; Firth, Jennifer; Whitmer, William M.; Bannister, Scott; Greasley, Alinka

doi:10.1016/j.dib.2026.112466

articleData in BriefJan 14, 2026GOLD OA

The Cadenza lyric intelligibility prediction (CLIP) dataset

GRGerardo Roa-Dabike TJTrevor J. Cox JBJon Barker BFBruno Fazenda SGSimone Graetzer

University of Sheffield · University of Salford · +3 more institutions

PubMed

Indexed incrossrefdoajpubmed

Abstract

This paper presents CLIP, a dataset of 11,072 popular western music signals sourced from independent artists, accompanied by ground truth lyrics, and lyric intelligibility scores from listening tests. The dataset is designed to facilitate music information retrieval (MIR) research using machine learning. It was created to allow the development of algorithms to predict lyric intelligibility for the Cadenza ICASSP 2026 Signal Processing Grand Challenge. Currently, it is the only publicly available large-scale dataset for such a task. The music was sourced from the Free Music Archive (FMA) dataset and is unlikely to be familiar to listeners. We excluded tracks whose license did not allow derivative works and…

Citation impact

6

total citations

FWCI: 204.41
Percentile: 100%
References: 6

Too recent for citation history.

Authors

11

Topics & keywords

Topics

Keywords

Intelligibility (philosophy)
Ground truth
Active listening
Music information retrieval
Common ground
German

UN Sustainable Development Goals

Quality Education

No related works found for this paper.