BehancePR: A Punctuation Restoration Dataset for Livestreaming Video TranscriptDownload PDF

Anonymous

08 Mar 2022 (modified: 05 May 2023)NAACL 2022 Conference Blind SubmissionReaders: Everyone
Paper Link: https://openreview.net/forum?id=t6XXc19YlQp
Paper Type: Short paper (up to four pages of content + unlimited references and appendices)
Abstract: Given the increasing number of livestreaming videos, automatic speech recognition and post-processing for livestreaming video transcripts are crucial for efficient data management as well as knowledge mining. A key step in this process is punctuation restoration which restores fundamental text structures such as phrase and sentence boundaries from the video transcripts. This work presents a new human-annotated corpus, called BehancePR, for punctuation restoration in livestreaming video transcripts. Our experiments on BehancePR demonstrate the challenges of punctuation restoration for this domain. Furthermore, we show that popular natural language processing toolkits like Stanford Stanza, Spacy, and Trankit underperform on detecting sentence boundary on non-punctuated transcripts of livestreaming videos. The dataset is publicly accessible at http://github.com/nlp-uoregon/behancepr.
Presentation Mode: This paper will be presented in person in Seattle
Copyright Consent Signature (type Name Or NA If Not Transferrable): VIET DAC LAI
Copyright Consent Name And Address: University of Oregon, 1477 E. 13th Avenue, Eugene, OR 97403, USA
0 Replies

Loading