Abstract: Large Language Models (LLMs) are increasingly deployed in high-risk domains. However, state-of-the-art LLMs often produce hallucinations, raising serious concerns about their reliability. Prior work has explored adversarial prompting for hallucination elicitation, but existing methods often produce unrealistic prompts---either by inserting gibberish tokens or by altering the original meaning---thus offering limited insight into how hallucinations may occur in practice. In sharp contrast, adversarial attacks to computer vision systems often elicit realistic modifications to the image that fool the classifier. To address this gap, we propose Semantically Equivalent and Coherent Attacks (SECA) to realistically elicit hallucinations via modifications to the prompt that preserve its meaning while maintaining coherence. Our contributions are threefold: (1) we formulate hallucination elicitation as a constrained optimization problem over the input prompt space under semantic equivalence and coherence constraints; (2) we introduce a constraint-preserving evolutionary algorithm to efficiently search for adversarial yet feasible prompts; and (3) we demonstrate through experiments on open-ended multiple-choice question answering tasks that SECA achieves higher attack success rates while incurring almost no constraint violations compared to existing methods. SECA highlights the sensitivity of both open-source and commercial gradient-inaccessible LLMs to realistic and plausible prompt variations.
Loading