Preprocessing Pathology Reports for Vision-Language Model Development

Ruben Lucassen; Tijn van de Luijtgaarden; Sander Moonemans; Willeke Blokx; Mitko Veta

Preprocessing Pathology Reports for Vision-Language Model Development

Ruben Lucassen, Tijn van de Luijtgaarden, Sander Moonemans, Willeke Blokx, Mitko Veta

Published: 16 Jul 2024, Last Modified: 22 Aug 2024COMPAYL 2024EveryoneRevisionsBibTeXCC BY 4.0

Keywords: pathology report, language models, translation, subsentence segmentation

TL;DR: Development of language models for translation from Dutch to English and subsentence segmentation for cutaneous melanocytic lesion pathology reports.

Abstract: Pathology reports are increasingly being used for development of vision-language models. Because the reports often include information that cannot directly be derived from paired images, careful selection of information is required to prevent hallucinations in tasks like report generation. In this paper, we present a language model for subsentence segmentation based on the information content, as part of a preprocessing workflow for 27,500 pathology reports of cutaneous melanocytic lesions. After initial clean up, the reports were first translated from Dutch to English and then segmented by separate language models. Both models were developed using an iterative approach, in which the development dataset was expanded with manually corrected model predictions for previously unannotated reports before finetuning the next version of the models. Over the course of eight iterations, the development dataset was in the end scaled up to 1,500 translated and annotated reports. On the independent test set of 3,597 sentences from 150 reports, 219 translation errors (6,1%) of different severities were counted. The subsentence segmentation model achieved a strong predictive performance on the test set with a macro average F1-score of 0.921 (95% CI, 0.890-0.940) and a weighted average F1-score of 0.952 (95% CI, 0.944-0.960) over 13 different classes. The remaining 25,850 unannotated reports were translated and segmented using the final models to complete the dataset preprocessing. Differences in word count and class distribution between section types of the reports were explored in preparation for future vision-language modelling. The presented methodology is generic and can, therefore, easily be extended to multiple or different pathology domains beyond melanocytic skin lesions. Code and trained model parameters are made publicly available.

Submission Number: 9

Loading