SAM-MIL: A Spatial Contextual Aware Multiple Instance Learning Approach for Whole Slide Image Classification
Abstract: Multiple Instance Learning (MIL) represents the predominant framework in Whole Slide Image (WSI) classification, covering aspects such as sub-typing, diagnosis, and beyond. Current MIL models predominantly rely on instance-level features derived from pretrained models such as ResNet. These models segment each WSI into independent patches and extract features from these local patches, leading to a significant loss of global spatial context and restricting the model's focus to merely local features. To address this issue, we propose a novel MIL framework, named SAM-MIL, that emphasizes spatial contextual awareness and explicitly incorporates spatial context by extracting comprehensive, image-level information. The Segment Anything Model (SAM) represents a pioneering visual segmentation foundational model that can capture segmentation features without the need for additional fine-tuning, rendering it an outstanding tool for extracting spatial context directly from raw WSIs. Our approach includes the design of group feature extraction based on spatial context and a SAM-Guided Group Masking strategy to mitigate class imbalance issues. We implement a dynamic mask ratio for different segmentation categories and supplement these with representative group features of categories. Moreover, SAM-MIL divides instances to generate additional pseudo-bags, thereby augmenting the training set, and introduces consistency of spatial context across pseudo-bags to further enhance the model's performance. Experimental results on the CAMELYON-16 and TCGA lung cancer datasets demonstrate that our proposed SAM-MIL model outperforms existing mainstream methods in WSIs classification.
Primary Subject Area: [Experience] Multimedia Applications
Secondary Subject Area: [Generation] Multimedia Foundation Models, [Content] Media Interpretation
Relevance To Conference: Our work, SAM-MIL, directly aligns with the ACM Multimedia conference's commitment to pioneering multimedia applications and advancing the interpretation and foundational models within the multimedia framework. By introducing a Spatial Contextual Aware Multiple Instance Learning model tailored for Whole Slide Image (WSI) classification, this research leverages multimedia data in a novel manner, emphasizing the integration of spatial contextual information. This approach not only advances the state-of-the-art in medical imaging but also contributes to the broader discourse on how multimedia can be utilized to enrich data interpretation and user interaction. The use of the Segment Anything Model (SAM) as a foundational tool for feature extraction showcases the potential of innovative multimedia technologies in addressing complex real-world problems like cancer subtype diagnosis and treatment planning. Our methodology's ability to enhance classification performance by incorporating richer, image-level contextual cues exemplifies the transformative impact of multimedia technologies in improving the accuracy and utility of medical diagnostics, a prime example of multimedia applications at the cutting edge of technology and healthcare.
Supplementary Material: zip
Submission Number: 4530
Loading