Lexical Sophistication and Zero-Shot Topic Modeling: Examining the Intersection of Word Choice and NLP Performance

Lexical Sophistication and Zero-Shot Topic Modeling: Examining the Intersection of Word Choice and NLP Performance

ACL ARR 2025 February Submission4033 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Lexical choice—the selection of specific words to convey meaning—plays a crucial role in both human communication and natural language processing (NLP). While traditional topic modeling methods like Latent Dirichlet Allocation (LDA) rely on word frequency and co-occurrence patterns, zero-shot topic modeling leverages pre-trained language models to classify unseen data without task-specific training, making them inherently sensitive to lexical choices. This study investigates how variations in lexical sophistication impact zero-shot topic modeling, focusing on potential biases in topic classification. Using the AG News dataset, original texts were paired with paraphrased versions generated by the PEGASUS model, and lexical sophistication was measured quantitatively. Analysis of RoBERTa’s topic predictions revealed moderate sensitivity to lexical changes, with a Lexical Bias Score (LBS) of 0.52. Instances of topic shifts between original and paraphrased texts further highlighted the model’s occasional misinterpretation of context due to subtle lexical differences. This study enhances our understanding of how language models process lexical sophistication, offering insights into computational linguistics and psycholinguistic theories. The findings underscore the need for continuous evaluation of pre-trained models to mitigate biases and improve fairness in NLP applications. Future research will explore cross-linguistic analyses, model comparisons, and the integration of human judgments to deepen the study of lexical sophistication in zero-shot learning contexts.

Paper Type: Long

Research Area: Machine Translation

Research Area Keywords: computational psycholinguistics

Contribution Types: Model analysis & interpretability, Data analysis

Languages Studied: English

Submission Number: 4033

Loading