WhisperX: Time-Accurate Speech Transcription of Long-Form Audio

Bain, Max; Huh, Jaesung; Han, Tengda; Zisserman, Andrew

doi:10.21437/interspeech.2023-78

articleAug 14, 2023Closed access

WhisperX: Time-Accurate Speech Transcription of Long-Form Audio

MBMax Bain JHJaesung Huh THTengda Han AZAndrew Zisserman

University of Oxford

Indexed incrossref

Abstract

Batch Input audio Pad to 30sFigure 1: WhisperX: We present a system for efficient speech transcription of long-form audio with word-level time alignment.The input audio is first segmented with Voice Activity Detection and then cut & merged into approximately 30-second input chunks with boundaries that lie on minimally active speech regions.The resulting chunks are then: (i) transcribed in parallel with whisper and (ii) forced aligned with a phone recognition model to produce accurate word-level timestamps at high throughput.

Citation impact

231

total citations

FWCI: 43.12
Percentile: 100%
References: 29

Citations per year

Authors

4

Topics & keywords

Topics

Keywords

Computer science
Transcription (linguistics)
Speech recognition
Speech coding

No related works found for this paper.

Funding

EA
Engineering and Physical Sciences Research Council
Awards: EP/T028572/1, EP/T028572/1