articleAug 14, 2023Closed access

WhisperX: Time-Accurate Speech Transcription of Long-Form Audio

University of Oxford

Indexed incrossref

Abstract

Batch Input audio Pad to 30sFigure 1: WhisperX: We present a system for efficient speech transcription of long-form audio with word-level time alignment.The input audio is first segmented with Voice Activity Detection and then cut & merged into approximately 30-second input chunks with boundaries that lie on minimally active speech regions.The resulting chunks are then: (i) transcribed in parallel with whisper and (ii) forced aligned with a phone recognition model to produce accurate word-level timestamps at high throughput.

Citation impact

231
total citations
FWCI
43.12
Percentile
100%
References
29
Citations per year

Authors

4

Topics & keywords

Keywords
  • Computer science
  • Transcription (linguistics)
  • Speech recognition
  • Speech coding
No related works found for this paper.

Funding