Lexical Sophistication and Zero-Shot Topic Modeling: Examining the Intersection of Word Choice and NLP Performance
Abstract: Lexical choice—the selection of specific words to convey meaning—plays a crucial role in both human communication and natural language processing (NLP). While traditional topic modeling methods like Latent Dirichlet Allocation (LDA) rely on word frequency and co-occurrence patterns, zero-shot topic modeling leverages pre-trained language models to classify unseen data without task-specific training, making them inherently sensitive to lexical choices. This study investigates how variations in lexical sophistication impact zero-shot topic modeling, focusing on potential biases in topic classification. Using the AG News dataset, original texts were paired with paraphrased versions generated by the PEGASUS model, and lexical sophistication was measured quantitatively. Analysis of RoBERTa’s topic predictions revealed moderate sensitivity to lexical changes, with a Lexical Bias Score (LBS) of 0.52. Instances of topic shifts between original and paraphrased texts further highlighted the model’s occasional misinterpretation of context due to subtle lexical differences. This study enhances our understanding of how language models process lexical sophistication, offering insights into computational linguistics and psycholinguistic theories. The findings underscore the need for continuous evaluation of pre-trained models to mitigate biases and improve fairness in NLP applications. Future research will explore cross-linguistic analyses, model comparisons, and the integration of human judgments to deepen the study of lexical sophistication in zero-shot learning contexts.
Paper Type: Long
Research Area: Machine Translation
Research Area Keywords: computational psycholinguistics
Contribution Types: Model analysis & interpretability, Data analysis
Languages Studied: English
Submission Number: 4033
Loading