Towards Understanding Domain Adapted Sentence Embeddings for Document Retrieval

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Sentence Embeddings, Question Answering, Technical Domains, Retrieval Augmented Generation, Embedding Models, Isotropy
TL;DR: The paper presents challenges in selecting embeddings for technical domains, metrics for domain adaptation and the impact of fine-tuning on accuracy and confidence intervals, and lack of correlation of isotropy of embeddings & retriever performance.
Abstract: A plethora of sentence embedding models makes it challenging to choose one, especially for technical domains rich with specialized vocabulary. In this work, we domain adapt embeddings using telecom, health and science datasets for question answering. We evaluate embeddings obtained from publicly available models and their domain-adapted variants, on both point retrieval accuracies, as well as their (95\%) confidence intervals. We establish a systematic method to obtain thresholds for similarity scores for different embeddings. As expected, we observe that fine-tuning improves mean bootstrapped accuracies. We also observe that it results in tighter confidence intervals, which further improve when pre-training is preceded by fine-tuning. We introduce metrics which measure the distributional overlaps of top-$K$, correct and random document similarities with the question. Further, we show that these metrics are correlated with retrieval accuracy and similarity thresholds. Recent literature shows conflicting effects of isotropy on retrieval accuracies. Our experiments establish that the isotropy of embeddings (as measured by two independent state-of-the-art isotropy metric definitions) is poorly correlated with retrieval performance. We show that embeddings for domain-specific sentences have little overlap with those for domain-agnostic ones, and fine-tuning moves them further apart. Based on our results, we provide recommendations for use of our methodology and metrics by researchers and practitioners.
Primary Area: applications to computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 10600
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview