Simulating Inter-observer Variability Across Clinical Experience Levels for Brain Tumour Segmentation
Abstract: Human-AI collaboration is essential for the development of trustworthy deep learning (DL) models for medical image analysis. However, datasets annotated by multiple clinical experts can introduce inter-observer variability, which can then give rise to annotator biases that may be learned by the DL model. Assessment of these biases is often hindered by the limited availability of multi-observer annotations for the same datasets. To address this limitation, we present a novel simulation framework that generates realistic variations in annotated segmentations to mimic inter-observer differences across simulated human experts with varying experience levels. Using brain tumour segmentation as a representative case study, we simulated three observer labels to train DL models. Our results show that DL models learn observer-specific annotation styles. For example, models trained on the data from a simulated senior radiologist with a tendency to under-segment the tumour tissue achieved higher performance than those trained on over-segmented ones. Inter-observer agreement was not strictly correlated with experience levels nor downstream DL model performance, demonstrating the complexity of annotation biases. Additionally, datasets with single ground-truth labels may mask important differences from learned annotation bias and over- or underestimate model performance. Human-AI collaboration, although necessary for medical imaging tasks, can introduce biases that negatively affect model segmentation performance and may undermine fairness, trust, and transparency. Our study takes an essential step toward understanding these risks and provides insights that support the development of human–AI collaborative systems designed for real-world clinical applicability.
External IDs:dblp:conf/miccai/GillettSSWF25
Loading