An Efficient Text Cleaning Pipeline for Clinical Text for Transformer Encoder Models

Published: 2024, Last Modified: 08 Dec 2025IS 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: It might be challenging to choose the best text preprocessing strategy in the field of natural language processing (NLP) due to the variety of techniques available. Given the popularity of transformer models, we wondered if preprocessing was necessary and, if so, what methods would improve the models' performance. Especially when working with clinical text data, accuracy is crucial. Our goal was to find an appropriate pre-processing pipeline for clinical texts that maintains or improves model performance. We experienced four common preprocessing techniques and their groupings on two datasets from MIMIC-3 and PubMed. We used four models: BERT base, BioBERT, BioClinicalBERT, and RoBERTa. The varied accuracy results from existing techniques inspired us to develop a new pipeline to improve accuracy. Our pipeline starts with removing repeated punctuation, normalizing the text with a CleanText function, and filtering less important words using TF-IDF scores to keep clinically applicable terms and moderate noise. Our results presented that our pipeline outperformed the base models. For the MIMIC-3 dataset, the BERT base model achieved 90.16% accuracy, and for the PubMed dataset, BioBERT achieved 64.20% accuracy. We also found that removing stop words decreased accuracy, while using TF-IDF either maintained or improved it up to 3%. Additionally, as we removed less important words from the documents our pipeline considerably reduced training time up to 17%.
Loading