HieRD: Hierarchical Relational Distillation for Vision-Language Embedding Models

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We introduce HieRD, a hierarchical distillation framework for Vision–Language Models that aligns clustered visual tokens with multi-granular text to transfer object-level semantics and cross-modal structure for efficient multimodal embedding.
Abstract: Knowledge distillation is crucial for compressing large Vision–Language Models (VLMs) into efficient architectures. While prior VLM research has primarily focused on reasoning tasks like visual question answering, multimodal embedding learning, a key component for large-scale retrieval, has received comparatively less attention. Existing distillation methods typically align static global representations, overlooking hierarchical feature structure and fine-grained cross-modal interactions. This leads to a structural gap where student models fail to inherit object-level semantics and spatial relationships from teachers. To address this limitation, we propose **HieRD**, a Hierarchical Representation Distillation framework that preserves hierarchical structure within and across modalities throughout the distillation process by leveraging clustered visual tokens and multi-granular alignment with phrase-level text. Experimental results on multimodal embedding and downstream tasks show that HieRD consistently outperforms strong baselines, reflecting the effectiveness of its fine-grained semantic and spatial modeling, while enabling compact and efficient embedding models.
Lay Summary: Many modern AI systems can understand both images and text, but the strongest models are often too large and expensive to use in real-world applications. This paper introduces HieRD, a method for training smaller vision-language models by helping them learn from larger ones more effectively. Instead of asking the smaller model to simply copy the final answers of the larger model, HieRD teaches it to preserve meaningful relationships between parts of an image and parts of a sentence. For example, it encourages the model to connect visual regions such as objects with related phrases in the text. This leads to smaller models that are more efficient while still performing well on tasks such as image-text search, image classification, and visual question answering. Our experiments show that HieRD consistently improves compact models compared with existing training methods.
Originally Submitted Supplementary Material: zip
Primary Area: Deep Learning->Other Representation Learning
Keywords: Vision Language Models, Knowledge Distillation, Multimodal Representation Learning
Originally Submitted PDF: pdf
Submission Number: 11657
Loading