Span-based Multi-grained Word Segmentation with Natural Annotations

Span-based Multi-grained Word Segmentation with Natural Annotations

ACL ARR 2024 June Submission5016 Authors

16 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Multi-grained word segmentation (MWS) differs from traditional single-grained word segmentation (SWS) by dividing a sentence into multiple word sequences at varying granularities. The scarcity of annotated MWS data has led previous studies to use automatically generated pseudo MWS data and treat MWS as a tree parsing task. However, this method is limited by the low quality of the pseudo data. In this work, we directly utilize multiple single-grained datasets and implement multi-task learning for MWS. To better address conflicts arising from words segmented at different granularities, we employ a span-based word segmentation model. Additionally, we incorporate naturally annotated BAIKE data to improve model performance in cross-domain applications. Experimental results demonstrate that our method achieved an F1 score improvement of 0.83 on the NEWS dataset and 4.8 on the BAIKE dataset. Furthermore, by employing data augmentation, we obtained an additional F1 score improvement of 2.23 on the BAIKE dataset.

Paper Type: Long

Research Area: Phonology, Morphology and Word Segmentation

Research Area Keywords: Chinese segmentation

Contribution Types: NLP engineering experiment

Languages Studied: Chinese

Submission Number: 5016

Loading