FILIP: Fine-grained Interactive Language-Image Pre-Training

Lewei Yao; Runhui Huang; Lu Hou; Guansong Lu; Minzhe Niu; Hang Xu; Xiaodan Liang; Zhenguo Li; Xin Jiang; Chunjing Xu

FILIP: Fine-grained Interactive Language-Image Pre-Training

Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, Chunjing Xu

Published: 28 Jan 2022, Last Modified: 22 Jun 2025ICLR 2022 PosterReaders: Everyone

Keywords: Visual-language pretraining, Language-Image Pretraining, Multi-modality model

Abstract: Unsupervised large-scale vision-language pre-training has shown promising advances on various downstream tasks. Existing methods often model the cross-modal interaction either via the similarity of the global feature of each modality which misses sufficient information, or finer-grained interactions using cross/self-attention upon visual and textual tokens. However, cross/self-attention suffers from inferior efficiency in both training and inference. In this paper, we introduce a large-scale Fine-grained Interactive Language-Image Pre-training (FILIP) to achieve finer-level alignment through a cross-modal late interaction mechanism, which uses a token-wise maximum similarity between visual and textual tokens to guide the contrastive objective. FILIP successfully leverages the finer-grained expressiveness between image patches and textual words by modifying only contrastive loss, while simultaneously gaining the ability to pre-compute image and text representations offline at inference, keeping both large-scale training and inference efficient. Furthermore, we construct a new large-scale image-text pair dataset called FILIP300M for pre-training. Experiments show that FILIP achieves state-of-the-art performance on multiple downstream vision-language tasks including zero-shot image classification and image-text retrieval. The visualization on word-patch alignment further shows that FILIP can learn meaningful fine-grained features with promising localization ability.

One-sentence Summary: We introduce a large-scale Fine-grained Interacitve Language-Image Pretraining (FILIP) to achieve finer-level alignment through a new cross-modal late interaction mechanism, which can boost the performance on more grounded vision and language tasks.

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/filip-fine-grained-interactive-language-image/code)

21 Replies

Loading