Rethinking MCQ Benchmarks: Mandatory Reasoning Evaluation Reveals Significant Performance Drops in Large Language Models

Yue Zhang; Nhan Nguyen

Rethinking MCQ Benchmarks: Mandatory Reasoning Evaluation Reveals Significant Performance Drops in Large Language Models

Yue Zhang, Nhan Nguyen

Published: 24 Sept 2025, Last Modified: 21 Oct 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM Evaluation, Multiple Choice Questions (MCQ) Benchmarks, Reasoning Evaluation

TL;DR: When mandating reasoning evaluation for MCQ benchmark beyond just final answer, we observed significant performance drop (as large as 45% absolute score) after assessing 21 LLMs from three model families on two MCQ datasets.

Abstract: Rigorous evaluation of Large Language Models (LLMs) is critical for their adoption in high-stakes applications, particularly in highly technical domains that require deep expertise and specialized training. The proliferation of LLMs from various providers further underscores the need for comprehensive model performance benchmarking. Like many standardized tests and certification exams, several prominent LLM benchmark datasets employ the Multiple Choice Questions (MCQs) format, such as GPQA and MMLU. Despite its convenience, prior research has demonstrated that MCQs oversimplify assessment and may not accurately reflect true model capabilities. Traditional MCQ evaluation only assesses whether the final answer is correct, potentially allowing models to succeed through lucky guessing or flawed reasoning. In the current work, we designed and performed a two-stage evaluation framework that mandates LLMs not only select the correct answer but also provide reasoning or derivation processes and justification for each options. We evaluated 21 LLMs across three model families on two datasets: GPQA and a custom dataset of 500 AWS-related questions curated from public online materials. Our results show that LLMs consistently struggle when both answer accuracy and reasoning quality are required, with performance scores dropping significantly across all models by 10\% to 45\% absolute points compared to conventional final answer-only evaluation. Based on these findings, we recommend that all MCQ benchmarks incorporate explanation evaluation as a standard component.

Submission Number: 117

Loading