WhisperKit: On-device Real-time ASR with Billion-Scale Transformers

Berkin Durmus; Arda Okan; Eduardo Pacheco; Zach Nagengast; Atila Orhon

WhisperKit: On-device Real-time ASR with Billion-Scale Transformers

Berkin Durmus, Arda Okan, Eduardo Pacheco, Zach Nagengast, Atila Orhon

Published: 10 Jun 2025, Last Modified: 01 Jul 2025TTODLer-FM @ ICML 2025 OralEveryoneRevisionsBibTeXCC BY 4.0

Keywords: On-device Inference, Real-time Transcription, Speculative Decoding, Model compression, On-device ML, Quantization, Automatic Speech Recognition

TL;DR: We present WhisperKit, an optimized on-device inference system for real-time ASR that significantly outperforms leading cloud-based systems.

Abstract: Real-time Automatic Speech Recognition (ASR) is a fundamental building block for many commercial applications of ML, including live captioning, dictation, meeting transcriptions, and medical scribes. Accuracy and latency are the most important factors when companies select a system to deploy. We present WhisperKit, an optimized on-device inference system for real-time ASR that significantly outperforms leading cloud-based systems. We benchmark against server-side systems that deploy a diverse set of models, including a frontier model (OpenAI gpt-4o-transcribe), a proprietary model (Deepgram nova-3), and an open-source model (Fireworks large-v3-turbo).Our results show that WhisperKit matches the lowest latency at 0.46s while achieving the highest accuracy 2.2\% WER. The optimizations behind the WhisperKit system are described in detail in this paper.

Submission Number: 37

Loading