Espnet-Summ: Introducing a Novel Large Dataset, Toolkit, and a Cross-Corpora Evaluation of Speech Summarization Systems

Roshan S. Sharma, William Chen, Takatomo Kano, Ruchira Sharma, Siddhant Arora, Shinji Watanabe, Atsunori Ogawa, Marc Delcroix, Rita Singh, Bhiksha Raj

Published: 2023, Last Modified: 12 Dec 2025ASRU 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Speech summarization has garnered significant interest and progressed rapidly over the past few years. In particular, end-to-end models have recently emerged as a competitive alternative to cascade systems for abstractive video summarization. This paper aims to establish progress in this rapidly evolving research field, by introducing ESPNet-SUMM, a new open-source toolkit that facilitates a comprehensive comparison of end-to-end and cascade speech summarization models on 4 different speech summarization tasks spanning diverse applications. Experiments demonstrate that end-to-end models perform better for larger corpora with shorter inputs. This work also introduces Interview, the largest public open-domain multiparty interview corpus with $4400 \mathrm{~h}$ of conversations between radio hosts and guests. Finally, this work explores the use of multiple datasets to improve end-to-end summarization, and experiments demonstrate the benefit of multi-style training over fine-tuning. 1