Span-based Multi-grained Word Segmentation with Natural Annotations

Abstract: Multi-grained word segmentation (MWS) differs from traditional single-grained word segmentation (SWS) by dividing a sentence into multiple word sequences at varying granularities. The scarcity of annotated MWS data has led previous studies to use automatically generated pseudo MWS data and treat MWS as a tree parsing task. However, this method is limited by the low quality of the pseudo data. In this work, we directly utilize multiple single-grained datasets and implement multi-task learning for MWS. To better address conflicts arising from words segmented at different granularities, we employ a span-based word segmentation model. Additionally, we incorporate naturally annotated BAIKE data to improve model performance in cross-domain applications. Experimental results demonstrate that our method achieved an F1 score improvement of 0.83 on the NEWS dataset and 4.8 on the BAIKE dataset. Furthermore, by employing data augmentation, we obtained an additional F1 score improvement of 2.23 on the BAIKE dataset.
