Abstract: Analogical reasoning in language models is a critical yet underexplored aspect of their capability, particularly as models grow in scale and training data. This work investigates the limitations of current models in inferring latent relational structures, focusing on lexical analogies. We introduce LAMBDA, a novel dataset of 3,000 relation-hidden lexical analogies spanning synonyms, antonyms, and derivational transformations, designed for two-shot induction. Our empirical evaluation across eight models, including four open-source models from 0.1B to 17B parameters, along with four commercial models, reveals a wide performance gap, with accuracies ranging from 0.3% to 46.4%, highlighting the challenge of systematic generalization. By analyzing error patterns such as identity echo and semantic drift, we provide insights into model weaknesses. These findings suggest that large-scale pre-training alone does not guarantee strong relational reasoning abilities, offering a foundation for targeted improvements in model design. Broader implications point to the potential for refining training methodologies to enhance analogical abstraction in language models.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~antonio_vergari2
Submission Number: 5326
Loading