Do LLM Semantic Judgments Follow Distributional Geometry? Prompt and Model Effects in Controlled Category Probes

Do LLM Semantic Judgments Follow Distributional Geometry? Prompt and Model Effects in Controlled Category Probes

ACL ARR 2026 January Submission10358 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: lexical relationships, word embeddings, semantic textual similarity, natural language inference, textual entailment

Abstract: Embeddings and next-token probabilities are often treated as interchangeable proxies for semantic knowledge in large language models (LLMs), but they need not reflect the same underlying structure. We present a controlled test of when representational geometry aligns with semantic behavior. We construct a counterbalanced matched-alternative dataset in which each probe is evaluated against a related and unrelated candidate under identical formatting, enabling within-item comparisons across relation type (cohyponym vs. superordinate), distractor difficulty (easy vs. hard, defined by semantic confusability), prompting regimes, and model. We evaluate five LLMs using two scoring lenses: logprob preference and static embedding similarity. Across models, logprob discrimination is uniformly high with modest difficulty penalties. By contrast, embedding similarity systematically underestimates superordinate relations, with the gap largest under hard distractors, while cohyponym judgments remain near ceiling. Metaprompt scaffolding produces larger shifts than prompt wording. Overall, behavior–geometry alignment is relation- and difficulty-dependent, cautioning against treating embedding similarity as a model-agnostic measure of semantic knowledge.

Paper Type: Long

Research Area: Semantics: Lexical, Sentence-level Semantics, Textual Inference and Other areas

Research Area Keywords: lexical relationships, word embeddings, semantic textual similarity, natural language inference, textual entailment

Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models, Data analysis, Theory

Languages Studied: English

Submission Number: 10358

Loading