LAMBDA: Assessing Few-shot Lexical Analogical Reasoning in Language Models

22 Jan 2026 (modified: 10 Apr 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Analogical reasoning in language models is a critical yet underexplored aspect of their capability, particularly as models grow in scale and training data. This work investigates the limitations of current models in inferring latent relational structures, focusing on lexical analogies. We introduce LAMBDA, a novel dataset of 3,000 relation-hidden lexical analogies spanning synonyms, antonyms, and derivational transformations, designed for two-shot induction. Our empirical evaluation across nine models, including four open-source models from 0.1B to 17B parameters, along with five commercial models, reveals a wide performance gap, with accuracies ranging from 0.3\% to 49.3\%, highlighting the challenge of systematic generalization. By analyzing error patterns such as identity echo and semantic drift, we provide insights into model weaknesses. Our findings suggest that large-scale pre-training alone does not guarantee strong relational reasoning abilities. These results identify a clear gap between lexical knowledge and reliable relation induction, and they provide a concrete target for future work on analogical abstraction in language models.
Submission Type: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=xsRxxm11pS
Changes Since Last Submission: 1. **Abstract and introduction revisions.** The closing claim of the abstract was updated to focus more directly on the gap between lexical knowledge and reliable relation induction. In the introduction, the benchmark framing was tightened, the example analogies were retained, and the early description of the paper’s empirical contribution was clarified. 2. **Related Work expansion and repositioning.** The Related Work section was substantially expanded, especially around analogy in cognitive science, lexical analogy benchmarks, and analogy behavior in language models. The revised version adds a broader discussion of analogy transfer, robustness, and few-shot relational induction, and it positions the benchmark more explicitly relative to prior work on analogy datasets and in-context reasoning. 3. **Problem setup and methodology clarification.** The task definition was rewritten to make the relation sets, dataset construction process, and scoring rule more explicit. The revised version now distinguishes more clearly between sampling one valid target during generation and accepting any valid member of the answer set during evaluation. The prompt construction, inference protocol, and scoring subsections were also updated so that the normalization and uncertainty computations are made more clear. 4. **Model-evaluation protocol details were added.** The description of inference was expanded to explain the GPT-5.2 configuration more carefully and to clarify how it fits within the common evaluation setup. The uncertainty discussion was also made more explicit by stating that results are summarized with binomial standard errors and corresponding 95% Wald intervals. We additionally rectified the issue with having a fixed confidence interval. 5. **Human evaluation and comparison text were strengthened.** The discussion of the human subset was expanded to report the subset size, relation-wise scores, and interval half-widths correctly. 6. **Auxiliary subset experiments were added and expanded.** We ran additional experiments with multi-token outputs and varying temperatures, and added an auxiliary evaluation table for model performance on the 300-item human subset, covering single-word and multi-token decoding, and repeated nonzero-temperature runs where applicable. 7. **Clarified answer-format sensitivity.** The revised paper now reports how accuracy changes when multi-token outputs are allowed, rather than only noting that the single-word protocol can exclude some valid answers. This directly addresses the role of output constraints in the reported results. 8. **Discussion of relation-wise behavior and scaling.** The discussion section was revised to better connect the observed performance differences across synonymy, antonymy, and derivation to part-of-speech balance, ambiguity, and morphological variation. The scaling discussion was also sharpened to emphasize that larger presumed model size does not by itself predict stronger lexical analogy performance. 9. **Limitations and future directions were broadened.** The limitations section was expanded to cover English-only scope, WordNet-based lexical coverage, possible pretraining overlap with lexical resources, and the consequences of strict single-token scoring. The future directions discussion was also extended to include freer-form evaluation, broader lexical resources, multilingual extensions, and stronger human evaluation. 10. **Supporting material was better connected to the main text.** The main paper now points more directly to the appendix material on token lengths, part-of-speech distributions, candidate-set sizes, prompt templates, and generation scripts. This makes the supporting analyses more directly tied to the paper’s central claims. We additionally removed uneccessary figures and diagrams where they did not actively support the paper's content.
Assigned Action Editor: ~antonio_vergari2
Submission Number: 7102
Loading