MIRROR: Multi‑model Inference and Regional Reasoning for Recognizing Annotation Errors

Rohan Raju Dhanakshirur; Keerti K M; Prasad Sudhakara Murthy; Chandan Aladahalli

MIRROR: Multi‑model Inference and Regional Reasoning for Recognizing Annotation Errors

Rohan Raju Dhanakshirur, Keerti K M, Prasad Sudhakara Murthy, Chandan Aladahalli

14 Apr 2026 (modified: 16 Apr 2026)MIDL 2026 Short Papers SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Medical Imaging, Noisy Labels, Annotation Error Detection, Multi‑Model Consistency, Explainable AI

TL;DR: When multiple models make the same mistake for the same reason, it’s often the annotation label, and not the model, that’s wrong.

Registration Requirement: Yes

Abstract: Accurate annotation is central to medical imaging AI. However, manual labeling remains error‑prone due to operator variability, low contrast, subtle or transient anatomical boundaries, etc. These errors are often instance‑dependent, arising precisely on the most clinically challenging frames, where conventional techniques such as confidence thresholding, repeated model training, etc. struggle to distinguish hard‑but‑correct samples from genuinely misannotated ones. Our analysis reveals that the modern regularized deep networks can tolerate random noise and they tend to exhibit consistent, convergent errors when the dataset label itself is incompatible with the underlying image content. Motivated by this, we introduce a simple architecture‑agnostic detector that identifies potential misannotations by jointly requiring unanimous disagreement/model confusion across diverse models and high cross‑model Grad‑CAM agreement. Frames with low spatial consensus are instead attributed to heterogeneous model errors rather than label corruption. Across MRI, and X-ray datasets with 5% synthetic label corruption, this dual‑consistency criterion recovers mislabeled samples with 93.00%, and 96.00% F1-score, outperforming the best performing state-of-the-art noisy‑label baselines by 8.14% and 3.23% respectively. Qualitative examples further show that flagged cases exhibit stable saliency across models. These results suggest that cross‑model semantic alignment against the provided label is a reliable and interpretable indicator of annotation error, enabling efficient, high‑precision data auditing without requiring clean subsets or repeated retraining.

Visa & Travel: No

Read CFP & Author Instructions: Yes

Originality Policy: Yes

Single-blind & Not Under Review Elsewhere: Yes

LLM Policy: Yes

Submission Number: 62

Loading