ALF: A Fine-Grained French Analogical Dataset for Evaluating Lexical Knowledge of Large Language Models
Abstract: The undeniable revolution brought forth by Large Language Models (LLMs) stems from the amazing fluency of the texts they generate, mastering language with seemingly human-like finesse. This fluency raises a key scientific question: How much lexical knowledge do LLMs actually capture in order to produce such fluent language? To address this, we present ALF, a freely available, analogical dataset endowed with rich lexicographic information grounded in Meaning-Text Theory for the French language. It comprises 2600 fine-grained lexical analogies with which we evaluate the lexical ability of five off-the-shelf LLMs, namely ChatGPT-4o mini, Llama3.0-8B, Llama3.1-8B, Qwen2.5-14B, and Mistral7B. Their performance spans from 45% for Mistral, through about 55% for the ChatGPT and Llama models, and up to nearly 60% for Qwen2.5-14B, thus qualifying ALF as a challenging dataset. Experimenting with larger models (OpenAI o1, Llama3.0/3.1-70B, and Qwen2.5-32B) yields rather limited returns considering the drastic increase in computational cost. We further identify certain types of analogies and prompting methods that reveal performance disparities.
External IDs:dblp:conf/ecai/PetrovVLLL25
Loading