Accelerating Diffusion Models for Discriminative Vision and Language Learners

Accelerating Diffusion Models for Discriminative Vision and Language Learners

ACL ARR 2024 December Submission1337 Authors

16 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Text-to-image diffusion models have demonstrated impressive generative capabilities, indicating they internalize substantive image-text representations. While these models have shown promise results, their potential in downstream discriminative applications is largely uncharted. In this paper, we delve into the capabilities of these diffusion models and improve the efficiency of using them as zero-shot vision and language learners. Towards this, we introduce a novel hierarchical sampling strategy that significantly optimizes the computational demands of these zero-shot diffusion models, making them faster and more feasible for real-world applications. Our work showcases the potential of text-to-image diffusion models as powerful tools for zero-shot image-text matching and sets the stage for more practical and effective applications of these models in real-world settings.

Paper Type: Short

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: Diffusion Models

Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 1337

Loading