Scalable Whole-Slide Vision-Language Modeling with Learned Token Pruning
Keywords: digital pathology, whole slide images, token pruning, vision-language modeling
TL;DR: Token pruning makes whole-slide pathology modeling over 10× more efficient while preserving accuracy by focusing computation on diagnostically relevant regions.
Abstract: Efficient modeling of whole-slide images (WSIs) is a central challenge in digital pathology. A single slide can expand into tens of thousands of patch tokens, pushing beyond the limits of standard transformer architectures and creating prohibitive computational costs. Existing foundational models employ efficient attention mechanisms, yet massive token counts remain a bottleneck. We propose SLIM (Slide-Level Interpretable Modeling with Token Pruning), a whole-slide vision–language framework that makes efficiency a core design principle by integrating token pruning into the slide representation stage. Starting from pretrained CONCH v1.5 patch embeddings, a LongNet-based encoder models ultra-long sequences while Cropr modules progressively discard low-utility tokens. Unlike token compression or merging, pruning directly shortens sequences, lowering memory and latency while preserving diagnostically relevant context. The pruning signal also offers interpretability, echoing how pathologists scan slides by ignoring background and focusing on salient tissue. For multimodal alignment, we adopt a CLIP-style contrastive objective with PubMedBERT as the text encoder, producing a compact joint space for retrieval and classification. Experiments on TCGA and EBRAINS show that pruning achieves a favorable efficiency–accuracy trade-off: our model matches or exceeds the performance of scale-heavy baselines such as Prov-GigaPath, while operating at an order of magnitude lower cost. Our results establish token pruning as a practical and interpretable strategy for scalable whole-slide modeling.
Submission Number: 93
Loading