Abstract: Subtitles play a crucial role in improving the accessibility of the vast amount of audiovisual content available on the Internet, allowing audiences worldwide to comprehend and engage with those contexts in various languages. Automatic subtitling (AS) systems are essential in alleviating the substantial workload of human transcribers and translators. However, existing AS corpora and the primary metric SubER focus on European languages. This paper introduces A-TASC, an Asian TED-based automatic subtitling corpus derived from English TED Talks, comprising nearly 800 hours of audio segments, aligned English transcripts, and subtitles in Chinese and Japanese. Meanwhile, we present SacreSubER, a modification of SubER, to enable the reliable evaluation of the subtitle quality. Experimental results of an end-to-end AS system and pipeline approaches based on strong ASR and LLMs on our corpora confirm the quality of the proposed corpus and reveal differences in AS performance between European and Asian languages.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: corpus creation,automatic evaluation of datasets
Contribution Types: Data resources
Languages Studied: English,Chinese,Japanese
Submission Number: 8054
Loading