WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

Harbin Institute of Technology · Nankai University · +3 more institutions

Indexed inarxivcrossref

Abstract

Self-supervised learning (SSL) achieves great success in speech recognition, while limited exploration has been attempted for other speech processing tasks. As speech signal contains multi-faceted information including speaker identity, paralinguistics, spoken content, etc., learning universal representations for all speech tasks is challenging. To tackle the problem, we propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks. WavLM jointly learns masked speech prediction and denoising in pre-training. By this means, WavLM does not only keep the speech content modeling capability by the masked speech prediction, but also improves the potential to non-ASR tasks by the speech…

Citation impact

1,655
total citations
FWCI
201.99
Percentile
100%
References
133
Citations per year

Authors

19

Topics & keywords

Keywords
  • Computer science
  • Speech recognition
  • Speech processing
  • Voice activity detection
  • Speech enhancement
  • Speech coding
  • Sequence labeling
  • Benchmark (surveying)
UN Sustainable Development Goals
  • Quality Education
No related works found for this paper.