Asymmetric Embedding Models for Hierarchical Retrieval: Provable Constructions and a Pretrain-Finetune Recipe

Chong You; Rajesh Jayaram; Ananda Theertha Suresh; Felix Yu; Sanjiv Kumar

Asymmetric Embedding Models for Hierarchical Retrieval: Provable Constructions and a Pretrain-Finetune Recipe

Chong You, Rajesh Jayaram, Ananda Theertha Suresh, Felix Yu, Sanjiv Kumar

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Information Retrieval, Hierarchical Retrieval, Dual encoders

TL;DR: Can dual encoder solve retrieval tasks where the set of documents has a hierarchy?

Abstract: Dual encoder (DE) models, where a pair of matching query and document are embedded into similar vector representations, are widely used in information retrieval due to their efficiency and scalability. However, DEs are known to have a limited expressive power due to the Euclidean geometry of the embedding space, which may compromise their quality. This paper investigate such limitations in the context of \emph{hierarchical retrieval}, the task where the document set has a hierarchical structure and the matching keywords for a query are all of its ancestor nodes. We first prove the feasibility of representing hierarchical structures within the Euclidean embedding space by providing a constructive algorithm for generating effective embeddings from a given hierarchy. Then we delve into the learning of DEs when the hierarchy is unknown, which is a practical assumption since usually only samples of matching query and document pairs are available during training. Our experiments reveal a "lost in the long distance" phenomenon, where retrieval accuracy degrades for documents further away in the hierarchy. To address this, we introduce a pretrain-finetune approach that significantly improves long-distance retrieval without sacrificing performance on closer documents. Finally, we validate our findings on a realistic hierarchy from WordNet, demonstrating the effectiveness of our approach in retrieving documents at various levels of abstraction.

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 11942

Loading