CS2W: A Chinese Spoken-to-Written Style Conversion Dataset with Multiple Conversion Types

Published: 07 Oct 2023, Last Modified: 01 Dec 2023EMNLP 2023 MainEveryoneRevisionsBibTeX
Submission Type: Regular Long Paper
Submission Track: Resources and Evaluation
Submission Track 2: Syntax, Parsing and their Applications
Keywords: Spoken-to-Written Style Conversion, Disfluency Detection, Grammatical Error Correction, ASR
TL;DR: A Fine-grained Chinese Spoken-to-Written Style Conversion Dataset with Multiple Conversion Types
Abstract: Spoken texts (either manual or automatic transcriptions from automatic speech recognition (ASR)) often contain disfluencies and grammatical errors, which pose tremendous challenges to downstream tasks. Converting spoken into written language is hence desirable. Unfortunately, the availability of datasets for this is limited. To address this issue, we present CS2W, a Chinese Spoken-to-Written style conversion dataset comprising 7,237 spoken sentences extracted from transcribed conversational texts. Four types of conversion problems are covered in CS2W: disfluencies, grammatical errors, ASR transcription errors, and colloquial words. Our annotation convention, data, and code are publicly available at https://github.com/guozishan/CS2W.
Submission Number: 2803
Loading