Bias Learning: Quantifying and Mitigating Position Sensitivity in Text Embeddings

Samarth Goel; Reagan Lee; Kannan Ramchandran

Bias Learning: Quantifying and Mitigating Position Sensitivity in Text Embeddings

Samarth Goel, Reagan Lee, Kannan Ramchandran

28 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Deep Learning or Neural Networks, Similarity and Distance Learning, (Application) Information Retrieval Regression, (Cognitive/Neuroscience) Language, (Other) Statistics

TL;DR: We demonstrate the existence of positional biases in text embedding models and investigate data augmentation methods to address these effects.

Abstract: Embedding models are crucial for tasks in Information Retrieval (IR) and semantic similarity measurement, yet their handling of longer texts and associated positional biases remains underexplored. In this study, we investigate the impact of content position and input size on text embeddings. Our experiments reveal that embedding models, particularly APE- and RoPE-based models, disproportionately prioritize the initial portion of the input. Ablation studies demonstrate that insertion of irrelevant text or removal at the start of a document reduces cosine similarity between altered and original embeddings by up to 12.3\% more than ablations at the end. Regression analysis further confirms this bias, with sentence importance declining as position moves further from the start, even with with content-agnosticity. We hypothesize that this effect arises from pre-processing strategies and chosen positional encoding techniques. To address this, we introduce a novel data augmentation scheme called Position-Aware Data Sampling (PADS), which mitigates positional bias and improves embedding robustness across varying input lengths. These findings quantify the sensitivity of retrieval systems and suggest a new lens towards long-context embedding models.

Supplementary Material: zip

Primary Area: interpretability and explainable AI

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 12736

Loading