Keywords: polyp size estimation, depth fusion, geometry-awareness, multi-modal learning, endoscopy
TL;DR: We introduce a polyp size estimation framework that fuses RGB texture, segmentation-derived geometry, and monocular depth via a PISE cross-attention block, resolving size–distance ambiguity and significantly improving accuracy.
Abstract: Accurately estimating the physical size of colorectal polyps from monocular endoscopy is
difficult due to scale ambiguity, viewpoint distortions, and strong inter-patient variability.
We introduce MPSE, a geometry-aware, depth-guided multimodal framework that jointly
leverages RGB appearance, monocular depth cues, and interpretable geometry descrip-
tors to produce reliable and clinically calibrated size estimates. Central to MPSE is a
geometry-as-query fusion block that selectively attends to depth and RGB features, and a
Scale Consistency Block (SCB) that models agreement between 2D footprint–derived and
3D depth–derived cues, reducing size bias under severe distribution imbalance. The model
is trained with a primary regression objective supported by an auxiliary threshold-based
classification loss that stabilizes predictions near clinically important cutoffs. On our clini-
cal dataset, MPSE achieves a mean absolute error of 0.93 mm and a polyp-level F1 score of
0.87 at the clinically critical 5 mm threshold, demonstrating accurate and clinically reliable
size estimation in endoscopy.
Primary Subject Area: Application: Endoscopy
Secondary Subject Area: Safe and Trustworthy Learning-assisted Solutions for Medical Imaging
Registration Requirement: Yes
Visa & Travel: Yes
Read CFP & Author Instructions: Yes
Originality Policy: Yes
Single-blind & Not Under Review Elsewhere: Yes
LLM Policy: Yes
Submission Number: 298
Loading