Word Structure Embedding and In-Context Learning for Chinese Segmentation

Published: 2025, Last Modified: 17 Jan 2026IEEE Access 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: We propose a novel approach to text segmentation by formulating the task as a binary classification problem for identifying sentence boundaries as segmentation markers. Boundaries are determined based on sentence similarity and correlation, with a particular focus on relationships between words within sentences. To capture these relationships, we introduce Word Structure Embedding and In-Context Learning (WSE-ICL), which connects sentence nodes through shared word nodes. Our method employs a GCN to embed word and [CLS] representations within each sentence, applies a template to fuse sample and residual data for feature extraction in LLMs, and then integrates embeddings and features via in-context learning. We evaluated our approach on five Chinese and cross-lingual datasets under varying data settings ( $4\sim 128$ shots) and LLM scale ( $0.5\sim 7$ B) settings. The results show an average F1 improvement of 1.00% over the previous best-performing methods under 1.5B and 128 shots with $p\lt 0.05$ , with performance improvements of 0.67% on Wiki-zh, 1.22% on Stories-zh, 0.61% on News-zh, 0.05% on News-ja, and 1.47% on News-ko. While our approach achieves state-of-the-art performance—particularly in few-shot scenarios—it still faces challenges due to the computational cost of GCN-based embeddings and multi-stage ICL processing. Additionally, performance depends on lexicon quality and segmentation assumptions. These limitations highlight trade-offs between segmentation accuracy and computational efficiency. Future work will focus on optimizing these aspects. The source code of the proposed method is publicly available at https://github.com/na978292231/WSE-ICL/tree/main/WSE-ICL-main
Loading