What is in a name? Mitigating Name Bias in Text Embedding Similarity via Anonymization

Published: 30 Jun 2025, Last Modified: 03 Apr 2026ACL findingsEveryoneCC BY 4.0
Abstract: Text-embedding models are often used for finding similarity between texts using cosine similarity in a variety of tasks. Most models exhibit biases arising from the data on which they are trained on. In this paper, we examine a hitherto unexplored bias in text-embeddings in similarity tasks: bias arising from the presence of names such as persons, locations, organizations, etc., in the text. Our study shows how the presence of name-bias in text-embedding models can potentially lead to erroneous conclusions in the assessment of thematic similarity. Text-embeddings can mistakenly indicate similarity between texts based on names in the text, even when their actual semantic contents do not have similarity or indicate dissimilarity simply because of the names in the text, even when the texts match semantically. We first demonstrate the presence of name bias in different text-embedding models and then propose text anonymization during inference, which involves removing references to names while preserving the core theme of the text. The efficacy of the anonymization approach is demonstrated on three downstream NLP tasks involving embedding similarities, achieving significant performance gains. Our simple and training-optimization-free approach offers a practical and easily implementable solution to mitigate name bias. The code of our work can be found at https://github.com/ sahilm1992/name_bias.
Loading