When All Options Are Wrong: Evaluating Large Language Model Robustness with Incorrect Multiple-Choice Options
Abstract: This paper examines the zero-shot ability of Large Language Models (LLMs)
to detect multiple-choice questions with no correct answer, a crucial aspect of
educational assessment quality. We explore this ability not only as a measure of
subject matter knowledge but also as an indicator of critical thinking within LLMs.
Our experiments, utilizing a range of LLMs on diverse questions, highlight the
significant performance gap between questions with a single correct answer and
those without. Llama-3.1-405B stands out by successfully identifying the lack
of a valid answer in many instances. These findings suggest that LLMs should
prioritize critical thinking over blind instruction following and caution against
their use in educational settings where questions with incorrect answers might lead
to inaccurate evaluations. This research sets a benchmark for assessing critical
thinking in LLMs and emphasizes the need for ongoing model alignment to ensure
genuine user comprehension and assistance.
Loading