SpaceCLIP: A Vision-Language Pretraining Framework With Spatial Reconstruction On TextDownload PDFOpen Website

Published: 01 Jan 2023, Last Modified: 05 Nov 2023ACM Multimedia 2023Readers: Everyone
Abstract: The tremendous progress of vision-to-language retrieval over these years is fueled by contrastive vision-language pretraining (VLP), such as CLIP. Although, contrastive methods do not exhibit the same level of performance on other downstream tasks (e.g., video question answering and natural language grounding). One possible reason is they ignore the misalignment between vision and language, especially the absence of spatial information in language. To mitigate this issue, We start from a new perspective and propose a contrastive VLP framework with spatial reconstruction on text (SpaceCLIP). Specifically, we introduce a unique reconstruction method to assign text representations into the same spatial structure with images or videos and a pretraining objective, SpatialNCE, to reduce the computational overhead and ensure performance on downstream tasks. Empirically, we show SpaceCLIP outperforms other methods with performance gains ranging from 2.1% up to 9.0% on MSRVTT and EgoCLIP multiple-choice questions answering, 2.5% up to 11.0% on EPIC-KITCHENS-100 and MSRVTT multi-instance retrieval, and 0.31% up to 7.2% on Ego4D natural language query benchmark.
0 Replies

Loading