From Generation to Selection Findings of Converting Analogical Problem-Solving into Multiple-Choice Questions
Abstract: As the reasoning abilities of artificial intelligence gain more attention, generating reliable benchmarks to evaluate reasoning capabilities is becoming increasingly important. The Abstract and Reasoning Corpus (ARC) is one of the introduced reasoning benchmarks, providing challenging problems that artificial intelligence has yet to solve. While ARC has been recognized for assessing reasoning abilities, it has a limitation in that its evaluation method through generation fails to consider other aspects of assessment. Bloom's taxonomy, widely known in education, argues that good evaluation methods should evaluate the six stages of Remember, Understand, Apply, Analyze, Evaluate, and Create in a step-by-step manner. Therefore, to utilize the current ARC, which only evaluates the Create stage, for assessing Understanding, Application, and other stages, our research aimed to modify the benchmark into a multiple-choice language format to make it more suitable for evaluating large language models (LLMs), termed MC-LARC. We evaluated the analogical reasoning abilities of ChatGPT4V with MC-LARC, confirming that 1) a multiple-choice format can support the language model's reasoning capabilities and 2) facilitate evidence analysis. However, we noticed LLMs relying on shortcuts when tackling MC-LARC. By analyzing this, we identified areas to consider in multiple-choice synthesis and specified criteria for what constitutes good choices based on these findings.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: corpus creation, benchmarking, language resources, evaluation
Contribution Types: Data resources, Data analysis
Languages Studied: English
Submission Number: 2928
Loading