Abstract: Text-to-image diffusion models have demonstrated impressive generative capabilities, indicating they internalize substantive image-text representations. While these models have shown promise results, their potential in downstream discriminative applications is largely uncharted. In this paper, we delve into the capabilities of these diffusion models and improve the efficiency of using them as zero-shot vision and language learners. Towards this, we introduce a novel hierarchical sampling strategy that significantly optimizes the computational demands of these zero-shot diffusion models, making them faster and more feasible for real-world applications. Our work showcases the potential of text-to-image diffusion models as powerful tools for zero-shot image-text matching and sets the stage for more practical and effective applications of these models in real-world settings.
Paper Type: Short
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: Diffusion Models
Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 1337
Loading