Mining Word Boundaries from Speech for Cross-domain Chinese Word Segmentation

ACL ARR 2024 June Submission4633 Authors

16 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Inspired by early research on exploring naturally annotated data for Chinese Word Segmentation (CWS), and also by recent research on integration of speech and text processing, this work for the first time proposes to explicitly mine word boundaries from parallel speech-text data. We employ the Montreal Forced Aligner (MFA) toolkit to perform character-level alignment on speech-text data, giving pauses as candidate word boundaries. Based on detailed analysis of collected pauses, we propose an effective probability-based strategy for filtering unreliable word boundaries. To more effectively utilize word boundaries as extra training data, we also propose a robust complete-then-train (CTT) strategy. We conduct cross-domain CWS experiments on two target domains, i.e., ZX and AISHELL2. We have also annotated about 900 sentences as the evaluation data of AISHELL2. Experiments demonstrate the effectiveness of our proposed approach.
Paper Type: Long
Research Area: Phonology, Morphology and Word Segmentation
Research Area Keywords: chinese segmentation,
Contribution Types: NLP engineering experiment, Data resources
Languages Studied: Chinese
Submission Number: 4633