False Friends or Cognates? A Cross-lingual Semantic Ambiguity Evaluation for Galician, Portuguese and Spanish
Keywords: false friends, multilingual evaluation, cross-lingual semantics
Abstract: The linguistic proximity between Galician, Portuguese, and Spanish results in a lexical overlap that often conceals semantic interference. This is particularly evident in false friends, posing a challenge for NLP systems.
In this work, we assess whether state-of-the-art language models can identify and process false friends among these languages. We introduce six cross-lingual datasets --created using semi-automatic methods or manual construction and all carefully verified-- covering cognates and false friends. We evaluate a broad range of encoder and decoder models of varying sizes via zero-shot and few-shot settings.
Our results show that closed-weight models achieve the highest accuracy and medium-weight models demonstrate a strong balance of efficiency, while smaller models struggle with systematic issues and biases.
Notably, we find that linguistic proximity itself introduces errors: closely related language pairs tend to perform worse, reflecting the challenge of semantic discrimination due to lexical overlap.
Paper Type: Long
Research Area: Semantics: Lexical, Sentence-level Semantics, Textual Inference and Other areas
Research Area Keywords: lexical relationships, semantic textual similarity, natural language inference, polysemy, word embeddings,
Contribution Types: Model analysis & interpretability, Data resources, Data analysis
Languages Studied: Galician, Portuguese, Spanish
Submission Number: 5921
Loading