PrismBench: Dynamic and Flexible Benchmarking of LLMs Code Generation with Monte Carlo Tree Search

TMLR Paper6314 Authors

26 Oct 2025 (modified: 03 Nov 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: The rapid advancement of LLMs' code generation capabilities is outpacing traditional evaluation methods. Static benchmarks fail to capture the depth and breadth of LLM capabilities and eventually become obsolete, while most dynamic approaches either rely too heavily on LLM-based evaluation or remain constrained by predefined test sets. To address these issues, we introduce PrismBench, a multi-agent, dynamic benchmarking framework designed to systematically expose and analyze LLM failure modes in code generation tasks. We formulate evaluation as a Markov Decision Process over a structured tree of coding challenges, leveraging a customized Monte Carlo Tree Search algorithm to traverse this tree and discover high-failure scenarios. Our multi-agent setup orchestrates task generation, model response, and analysis, enabling scalable assessment across diverse coding challenges. Additionally, we propose metrics that combine structural traversal patterns with performance across different tasks and difficulty levels to enable diagnostic and systematic comparison of LLMs' performance. We conduct extensive experiments on eight state-of-the-art LLMs and analyze how model architecture and scale influence code generation performance across varying coding tasks. All code, evaluation trees, and a public leaderboard are available at https://prismbench.github.io/Demo/
Submission Length: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Yossi_Adi1
Submission Number: 6314
Loading