VoxMM: Rich Transcription of Conversations in the Wild

Doyeop Kwak, Jaemin Jung, Kihyun Nam, Youngjoon Jang, Jee-Weon Jung, Shinji Watanabe, Joon Son Chung

Published: 2024, Last Modified: 14 May 2026ICASSP 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: This paper presents a multi-modal dataset that contains rich transcriptions of spoken conversations. As diverse multi-modal and multi-task models emerge, there is a growing need for multi-modal training and evaluation datasets accompanied by rich metadata. However, there is no universal dataset that addresses these requirements for the diverse tasks partially due to the cost of annotation. To overcome this limitation, we develop a semi-automatic pipeline that makes the annotation more feasible. The resulting dataset is VoxMM, a multi-modal, multi-domain dataset. VoxMM incorporates video, audio, and text modalities. In terms of labels, it offers a wide array of metadata such as speaker labels, transcriptions, gender, and more. VoxMM supports both the training and the evaluation of any-to-any modality mapping models. It also offers a more accurate representation of real-world scenarios, bridging the gap between controlled laboratory experiments and the varying performances in the real-world. We present initial benchmarks on automatic speech recognition and speaker diarisation. The VoxMM dataset can be downloaded from https://mm.kaist.ac.kr/projects/voxmm