Text-Informed Image Pruning for Efficient and Interpretable Vision Language Models

Anonymous

25 Jun 2022OpenReview Anonymous Preprint Blind SubmissionReaders: Everyone

Keywords: efficient vision language models, data pruning, interpretable vision language models, image pruning

TL;DR: We present a Text-informed Image Pruning method that progressively removes text-irrelevant portions of the input image, improving model inference speed, reducing memory footprint and providing interpretability.

Abstract: Large-scale vision language (VL) models use transformers to perform cross-modal interactions between the input text and image. These cross-modal interactions are computationally expensive and memory-intensive due to the quadratic complexity of processing the input image. We present TiP: a Text-informed Image Pruning method that progressively removes text-irrelevant portions of the input image, improving model inference speed and reducing memory footprint. We design several lightweight modules --- token pruners --- and add them to the cross-modal layers in a VL model to predict which image portions are salient. To train TiP, we introduce a text-informed contrastive learning technique that optimizes the representation similarity between the text and the salient text-relevant image portions predicted by the token pruners. Our neighbor-based continuity regularization loss encourages the pruners to select contiguous segments of the image as relevant. Our evaluation for two vision language models on three downstream VL tasks shows TiP prunes over 87% of input image data, thus increasing inference throughput by over 1.5x and reducing memory footprint by over 36%, while incurring less than a 1% accuracy drop. TiP is also interpretable by construction. Code is available at anonymized_url.

0 Replies