Keywords: Computational Pathology, Whole slide image, Token tuning
Abstract: Whole-slide image (WSI) classification is commonly cast as multiple instance learning (MIL): a slide (bag) is positive if at least one patch (instance) is positive. Attention-based MIL models have become a de-facto choice because they produce slide-level predictions and instance-level attention maps. In this paper we show that a simple yet overlooked modification—fine-tuning only the [CLS] token within an attention-based MIL aggregator—consistently and substantially improves slide-level accuracy while reducing trainable parameters and training instability. Concretely, we insert a learnable [CLS] query token that attends to instance embeddings and we freeze the rest of the aggregator and the patch encoder; we also introduce a CLS-gate that calibrates attention logits without changing the backbone. Across three public WSI benchmarks and multiple backbones, CLS-tuning yields +4.02 to +6.34 absolute accuracy gains over strong attention-MIL baselines. We further provide a concise proof that linear combinations of bag features need not be linearly separable, clarifying why learned feature mappings (such as those induced by CLS-tuned attention) can recover linear separability at the bag level. Our approach is drop-in, architecture-agnostic, and training-efficient, making it attractive for large-scale WSI deployment.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 21447
Loading