articleAug 14, 2023Closed access
WhisperX: Time-Accurate Speech Transcription of Long-Form Audio
Indexed incrossref
Abstract
Batch Input audio Pad to 30sFigure 1: WhisperX: We present a system for efficient speech transcription of long-form audio with word-level time alignment.The input audio is first segmented with Voice Activity Detection and then cut & merged into approximately 30-second input chunks with boundaries that lie on minimally active speech regions.The resulting chunks are then: (i) transcribed in parallel with whisper and (ii) forced aligned with a phone recognition model to produce accurate word-level timestamps at high throughput.
Citation impact
231
total citations
- FWCI
- 43.12
- Percentile
- 100%
- References
- 29
Citations per year
Authors
4Topics & keywords
Topics
Keywords
- Computer science
- Transcription (linguistics)
- Speech recognition
- Speech coding
No related works found for this paper.