Abstract: Text-embedding models are often used for finding similarity between texts using cosine similarity in a variety of tasks. Most models exhibit biases arising from the data on which they
are trained on. In this paper, we examine a
hitherto unexplored bias in text-embeddings in
similarity tasks: bias arising from the presence
of names such as persons, locations, organizations, etc., in the text. Our study shows how
the presence of name-bias in text-embedding
models can potentially lead to erroneous conclusions in the assessment of thematic similarity. Text-embeddings can mistakenly indicate similarity between texts based on names
in the text, even when their actual semantic
contents do not have similarity or indicate
dissimilarity simply because of the names in
the text, even when the texts match semantically. We first demonstrate the presence of
name bias in different text-embedding models
and then propose text anonymization during inference, which involves removing references
to names while preserving the core theme of
the text. The efficacy of the anonymization approach is demonstrated on three downstream
NLP tasks involving embedding similarities,
achieving significant performance gains. Our
simple and training-optimization-free approach
offers a practical and easily implementable solution to mitigate name bias. The code of our
work can be found at https://github.com/
sahilm1992/name_bias.
Loading