CodeInsightBench: A Benchmark for Advanced Code Understanding and Comparison in Large Language Models

03 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: code comprehension; code comparison; benchmark
Abstract: While Large Language Models (LLMs) have demonstrated significant progress in coding tasks, their capabilities in nuanced code analysis and deep comprehension remain insufficiently explored. To address this gap, we introduce CodeInsightBench, a comprehensive multilingual benchmark designed to systematically evaluate advanced code reasoning. Built upon real-world Codeforces data, it employs both multiple-choice and open-ended questions to assess three core tasks: Semantic Code Judgment, Debugging Path Tracking, and Code Efficiency Comparison. We conduct extensive evaluations on 22 state-of-the-art LLMs (11 closed-source, 11 open-source), revealing critical insights into their strengths and limitations. Our results reveal substantial performance gaps across tasks, with closed-source models generally outperforming open-source counterparts. Besides, models fail primarily on large-scale code transformations, indicating fundamental limitations in understanding code evolution logic. Additionally, the results indicate distinct programming language preferences in code efficiency comparison, and show that multiple sampling substantially improves semantic code judgment performance, with Pass@3 achieving 92.71% accuracy compared to 60.57% at Pass@1. By providing a comprehensive and systematic evaluation methodology, CodeInsightBench enables deeper understanding of LLM capabilities in sophisticated code comprehension tasks.
Primary Area: datasets and benchmarks
Submission Number: 1190
Loading