Impact of Morphological Segmentation on Pre-trained Language Models

Matheus Westhelle, Luciana Bencke, Viviane P. Moreira

Published: 01 Jan 2022, Last Modified: 01 Mar 2024BRACIS (2) 2022Readers: Everyone

Abstract: Pre-trained Language Models are the current state-of-the-art in many natural language processing tasks. These models rely on subword-based tokenization to solve the problem of out-of-vocabulary words. However, commonly used subword segmentation methods have no linguistic foundation. In this paper, we investigate the hypothesis that the study of internal word structure (i.e., morphology) can offer informed priors to these models, such that they perform better in common tasks. We employ an unsupervised morpheme discovery method in a new word segmentation approach, which we call Morphologically Informed Segmentation (MIS), to test our hypothesis. Experiments with MIS on several natural language understanding tasks (text classification, recognizing textual entailment, and question-answering), in Portuguese, yielded promising results compared to a WordPiece baseline.

0 Replies