everyone
since 04 Oct 2024">EveryoneRevisionsBibTeXCC BY 4.0
Dual encoder (DE) models, where a pair of matching query and document are embedded into similar vector representations, are widely used in information retrieval due to their efficiency and scalability. However, DEs are known to have a limited expressive power due to the Euclidean geometry of the embedding space, which may compromise their quality. This paper investigate such limitations in the context of \emph{hierarchical retrieval}, the task where the document set has a hierarchical structure and the matching keywords for a query are all of its ancestor nodes. We first prove the feasibility of representing hierarchical structures within the Euclidean embedding space by providing a constructive algorithm for generating effective embeddings from a given hierarchy. Then we delve into the learning of DEs when the hierarchy is unknown, which is a practical assumption since usually only samples of matching query and document pairs are available during training. Our experiments reveal a "lost in the long distance" phenomenon, where retrieval accuracy degrades for documents further away in the hierarchy. To address this, we introduce a pretrain-finetune approach that significantly improves long-distance retrieval without sacrificing performance on closer documents. Finally, we validate our findings on a realistic hierarchy from WordNet, demonstrating the effectiveness of our approach in retrieving documents at various levels of abstraction.