Abstract: We analyze the ability of LLMs to answer comparison questions (e.g., ``Which is longer, the Danube or the Nile?''). Our central observation is that LLMs often make mistakes when answering such questions, even when they have the required knowledge (e.g., the length of the rivers involved). We furthermore find that their predictions are heavily influenced by superficial biases, such as the position of the entities in the question, their relative popularity, and shallow co-occurrence statistics. These findings suggest that simple prompting-based strategies may not leverage the ranking abilities of LLMs to their full potential, and that LLMs continue to struggle with even simple reasoning tasks.
Paper Type: Short
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: data shortcuts/artifacts
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 5723
Loading