AlignedFusion: Handling Missing Information in Reports and Inter-Modality Information Imbalance
Keywords: Missing data, Transformer, Pathology, CT lesion detection, Skin tumor
Abstract: Integrating textual reports and visual information is crucial for multimodal medical AI. However, existing approaches face two major challenges: (1) Handling omitted information in reports, as textual encoders struggle to differentiate between unmentioned and truly absent attributes, leading to inconsistencies in feature learning, and (2) Inter-modality information imbalance, where direct token-wise attention between text and images causes instability due to the disparity in information richness between modalities. To address these issues, we propose AlignedFusion, a novel multimodal fusion framework with two key components: (1) Attribute-wise Report Token Generation with Masked Token Reconstruction, which structures medical reports into explicit attribute categories and reconstructs missing attributes to reduce feature variance, and (2) Intermediate Token-Based Fusion, which stabilizes multimodal learning by inserting an intermediate token as a bridge between textual and visual representations, ensuring a balanced and effective fusion. We evaluate AlignedFusion on four medical analysis tasks using two public and two private datasets, demonstrating its adaptability and robustness. Experimental results show that our approach improves alignment between textual and visual features, mitigates training instability, and enhances predictive performance, advancing the field of multimodal medical AI.Code will be available upon acceptance.
Primary Subject Area: Application: Histopathology
Secondary Subject Area: Learning with Noisy Labels and Limited Data
Registration Requirement: Yes
Read CFP & Author Instructions: Yes
Originality Policy: Yes
Single-blind & Not Under Review Elsewhere: Yes
LLM Policy: Yes
Submission Number: 85
Loading