ObfusQAte: A Proposed Framework to Evaluate LLM Robustness on Obfuscated Factual Question Answering

ObfusQAte: A Proposed Framework to Evaluate LLM Robustness on Obfuscated Factual Question Answering

ACL ARR 2025 July Submission563 Authors

28 Jul 2025 (modified: 31 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The rapid proliferation of Large Language Models (LLMs) has significantly contributed to the development of equitable AI systems capable of factual question-answering (QA). However, no known study tests the LLMs' robustness when presented with obfuscated versions of questions. To systematically evaluate these limitations, we propose a novel technique, **ObfusQAte** and leveraging the same, introduce **ObfusQA**, a comprehensive, first of its kind, framework, with designed to examine LLM capabilities across three distinct dimensions: *(i) Named-Entity Indirection*, *(ii) Distractor Indirection*, and *(iii) Contextual Overload*. By capturing these fine-grained distinctions in language, ObfusQA provides a comprehensive benchmark for evaluating LLM robustness and adaptability. Our study observes that LLMs exhibit a tendency to fail or generate hallucinated responses when confronted with these increasingly nuanced variations. To foster research in this direction, we make ObfusQAte publicly available.

Paper Type: Short

Research Area: Resources and Evaluation

Research Area Keywords: Question Answering, Interpretability and Analysis of Models for NLP, Machine Learning for NLP, NLP Applications

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources

Languages Studied: English

Previous URL: https://openreview.net/forum?id=vvSz7zzfJ5

Explanation Of Revisions PDF: pdf

Reassignment Request Area Chair: Yes, I want a different area chair for our submission

Reassignment Request Reviewers: Yes, I want a different set of reviewers

Justification For Not Keeping Action Editor Or Reviewers: The previous reviewers and meta-reviewer did not engage meaningfully with our documented revisions and provided vague feedback without addressing specific changes. We believe a fresh set of reviewers with relevant expertise would ensure a more constructive and fair evaluation.

A1 Limitations Section: This paper has a limitations section.

A2 Potential Risks: N/A

B Use Or Create Scientific Artifacts: Yes

B1 Cite Creators Of Artifacts: N/A

B2 Discuss The License For Artifacts: N/A

B3 Artifact Use Consistent With Intended Use: N/A

B4 Data Contains Personally Identifying Info Or Offensive Content: N/A

B5 Documentation Of Artifacts: N/A

B6 Statistics For Data: Yes

B6 Elaboration: 2

C Computational Experiments: Yes

C1 Model Size And Budget: Yes

C1 Elaboration: 3, Appendix

C2 Experimental Setup And Hyperparameters: Yes

C2 Elaboration: 3, Appendix

C3 Descriptive Statistics: Yes

C3 Elaboration: 3, Appendix

C4 Parameters For Packages: Yes

C4 Elaboration: 3, Appendix

D Human Subjects Including Annotators: Yes

D1 Instructions Given To Participants: Yes

D1 Elaboration: 2, Appendix

D2 Recruitment And Payment: Yes

D2 Elaboration: 2, Appendix

D3 Data Consent: N/A

D4 Ethics Review Board Approval: N/A

D5 Characteristics Of Annotators: Yes

D5 Elaboration: 2, Appendix

E Ai Assistants In Research Or Writing: No

E1 Information About Use Of Ai Assistants: N/A

Author Submission Checklist: yes

Submission Number: 563

Loading