Quilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from Open-Source Histopathology Videos
Abstract: Diagnosis in histopathology requires a global whole slide
images (WSIs) analysis, requiring pathologists to compound
evidence from different WSI patches. The gigapixel scale of
WSIs poses a challenge for histopathology multi-modal models. Training multi-model models for histopathology requires
instruction tuning datasets, which currently contain information for individual image patches, without a spatial grounding of the concepts within each patch and without a wider
view of the WSI. To bridge this gap, we introduce QUILTINSTRUCT, a large-scale dataset of 107, 131 histopathologyspecific instruction question/answer pairs, grounded within
diagnostically relevant image patches that make up the
WSI. Our dataset is collected by leveraging educational
histopathology videos from YouTube, which provides spatial localization of narrations by automatically extracting
the narrators’ cursor positions. QUILT-INSTRUCT supports
contextual reasoning by extracting diagnosis and supporting facts from the entire WSI. Using QUILT-INSTRUCT, we
train QUILT-LLAVA, which can reason beyond the given
single image patch, enabling diagnostic reasoning across
patches. To evaluate QUILT-LLAVA, we propose a comprehensive evaluation dataset created from 985 images and
1283 human-generated question-answers. We also thoroughly evaluate QUILT-LLAVA using public histopathology
datasets, where QUILT-LLAVA significantly outperforms
SOTA by over 10% on relative GPT-4 score and 4% and 9%
on open and closed set VQA1.
Loading