Abstract: The rapid proliferation of Large Language Models (LLMs) has significantly contributed to the development of equitable AI systems capable of factual question-answering (QA). However, no known study tests the LLMs' robustness when presented with obfuscated versions of questions. To systematically evaluate these limitations, we propose a novel technique, **ObfusQAte** and leveraging the same, introduce **ObfusQA**, a comprehensive, first of its kind, framework, with designed to examine LLM capabilities across three distinct dimensions: *(i) Named-Entity Indirection*, *(ii) Distractor Indirection*, and *(iii) Contextual Overload*. By capturing these fine-grained distinctions in language, ObfusQA provides a comprehensive benchmark for evaluating LLM robustness and adaptability. Our study observes that LLMs exhibit a tendency to fail or generate hallucinated responses when confronted with these increasingly nuanced variations. To foster research in this direction, we make ObfusQAte publicly available.
Paper Type: Short
Research Area: Resources and Evaluation
Research Area Keywords: Question Answering, Interpretability and Analysis of Models for NLP, Machine Learning for NLP, NLP Applications
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources
Languages Studied: English
Previous URL: https://openreview.net/forum?id=vvSz7zzfJ5
Explanation Of Revisions PDF: pdf
Reassignment Request Area Chair: Yes, I want a different area chair for our submission
Reassignment Request Reviewers: Yes, I want a different set of reviewers
Justification For Not Keeping Action Editor Or Reviewers: The previous reviewers and meta-reviewer did not engage meaningfully with our documented revisions and provided vague feedback without addressing specific changes. We believe a fresh set of reviewers with relevant expertise would ensure a more constructive and fair evaluation.
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: N/A
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: N/A
B2 Discuss The License For Artifacts: N/A
B3 Artifact Use Consistent With Intended Use: N/A
B4 Data Contains Personally Identifying Info Or Offensive Content: N/A
B5 Documentation Of Artifacts: N/A
B6 Statistics For Data: Yes
B6 Elaboration: 2
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: 3, Appendix
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: 3, Appendix
C3 Descriptive Statistics: Yes
C3 Elaboration: 3, Appendix
C4 Parameters For Packages: Yes
C4 Elaboration: 3, Appendix
D Human Subjects Including Annotators: Yes
D1 Instructions Given To Participants: Yes
D1 Elaboration: 2, Appendix
D2 Recruitment And Payment: Yes
D2 Elaboration: 2, Appendix
D3 Data Consent: N/A
D4 Ethics Review Board Approval: N/A
D5 Characteristics Of Annotators: Yes
D5 Elaboration: 2, Appendix
E Ai Assistants In Research Or Writing: No
E1 Information About Use Of Ai Assistants: N/A
Author Submission Checklist: yes
Submission Number: 563
Loading