A More Effective Sentence-Wise Text Segmentation Approach Using BERTOpen Website

Published: 01 Jan 2021, Last Modified: 13 May 2023ICDAR (4) 2021Readers: Everyone
Abstract: Text Segmentation is a Natural Language Processing based task that is aimed to divide paragraphs and bodies of text into topical, semantic blocks. This plays an important role in creating structured, searchable text-based representations after digitizing paper-based documents for example. Traditionally, text segmentation has been approached with sub-optimal feature engineering efforts and heuristic modelling. We propose a novel supervised training procedure with a pre-labeled text corpus along with an improved neural Deep Learning model for improved predictions. Our results are evaluated with the Pk and WindowDiff metrics and show performance improvements beyond any public text segmentation system that exists currently. The proposed system utilizes Bidirectional Encoder Representations from Transformers (BERT) as an encoding mechanism, which feeds to several downstream layers with a final classification output layer, and even shows promise for improved results with future iterations of BERT.
0 Replies

Loading