0.01 Cent per Second: Developing a Cloud-based Cost-effective Audio Transcription System for an Online Video Learning Platform

Abstract: Using automatic speech recognition (ASR) to transcribe videos in an online video learning platform can benefit learners in multiple ways. However, existing speech-to-text APIs can be costly to use, especially for long lecture videos commonly found in such platform. In this work, we developed a cloud-based ASR system that is cost-optimized for the workload of online learning platforms. We characterized such workload and applied a combination of techniques from system architecture, including: (1) serverless, (2) preemptible instance, and (3) batching and audio transcription optimization, including: (1) audio segmentation, (2) cost-based segment merging, and (3) locally hosted transcription model. All of which work together to provide a low transcription cost per minute of audio. We experimented and calculated the processing cost, time, and accuracy and showed that our system offers accuracy on par with existing speech-to-text services at a significantly lower cost. We have also integrated this system into an online video learning platform.
0 Replies
Loading