Submission Track: Short papers presenting ongoing research or work submitted to other venues (up to 5 pages, excluding references)
Keywords: digital pathology, parameter efficient finetuning, vision language models, alignment
TL;DR: We benchmark PEFT for adapting vision–language models in pathology, showing near parity with full-data models, but gaps persist in few-shot tasks. Novel dataset reveals the need for better interpretability and multimodal reasoning.
Abstract: Generalist vision–language models (VLMs) struggle on histopathology tasks due to domain gaps and scarce labels, pathology VLMs (PFMs) also fall short despite costly pretraining. Parameter-efficient fine-tuning (PEFT) offers a scalable lightweight alternative, while improving performance. We present the first benchmark and taxonomy of PEFT for pathology VLMs, organizing methods by adaptation modality, strategy and locus. We curate a novel neuropathology dataset for detecting neurofibrillary tangles (NFTs), capturing annotator variability to evaluate reliability and alignment. Experiments across prostate cancer, colorectal cancer, and neuropathology tasks show that with full data, PEFT-adapted generalist VLMs rival adapted PFMs, while in few shot settings, a residual gap persists due to label scarcity, terminology mismatch, and modality-specific biases. Visualization further reveals that models such as models such as CONCH+MMRL focus on NFT within annotated boxes within annotated boxes, improving interpretability in single-NFT cases, though performance diminishes in complex multi-NFT scenarios. Together, our benchmark and dataset highlight PEFT as a scalable strategy, but also indicate the need for richer interpretability metrics and improved multimodal reasoning to handle complex cases.
Submission Number: 63
Loading