Temporal Modeling With Frozen Vision-Language Foundation Models for Parameter-Efficient Text-Video Retrieval
Abstract: Temporal modeling plays an important role in the effective adaption of the powerful pretrained text–image foundation model into text–video retrieval. However, existing methods often rely on additional heavy trainable modules, such as transformer or BiLSTM, which are inefficient. In contrast, we avoid introducing such heavy components by leveraging frozen foundation models. To this end, we propose temporal modeling with frozen vision–language foundation models (TFVL) to model the temporal dynamics with fixed encoders. Specifically, text encoder temporal modeling (TextTemp) and image encoder temporal modeling (ImageTemp) apply frozen text and image encoders within the video head and video backbone, respectively. TextTemp uses a frozen text encoder to interpret frame representations as “visual words” within a temporal “sentence,” capturing temporal dependencies. On the other hand, ImageTemp uses a frozen image encoder to treat all frame tokens as a unified visual entity, learning spatiotemporal information. The total trainable parameters of our method, comprising a lightweight projection and several prompt tokens, are significantly fewer than those in other existing methods. We evaluate the effectiveness of our method on MSR-VTT, DiDeMo, ActivityNet, and LSMDC. Compared with full fine-tuning on MSR-VTT, our TFVL achieves an average 3.25% gain in R@1 with merely 0.35% of the parameters. Extensive experiments demonstrate that the proposed TFVL outperforms state-of-the-art methods with significantly fewer parameters.
External IDs:dblp:journals/tnn/ShenHHZLZHD25
Loading