Abstract: The rapid advancement of LLMs' code generation capabilities is outpacing traditional evaluation methods. Static benchmarks fail to capture the depth and breadth of LLM capabilities and eventually become obsolete, while most dynamic approaches either rely too heavily on LLM-based evaluation or remain constrained by predefined test sets. To address these issues, we introduce PrismBench, a multi-agent, dynamic benchmarking framework designed to systematically expose and analyze LLM failure modes in code generation tasks. We formulate evaluation as a Markov Decision Process over a structured tree of coding challenges, leveraging a customized Monte Carlo Tree Search algorithm to traverse this tree and discover high-failure scenarios. Our multi-agent setup orchestrates task generation, model response, and analysis, enabling scalable assessment across diverse coding challenges. Additionally, we propose metrics that combine structural traversal patterns with performance across different tasks and difficulty levels to enable diagnostic and systematic comparison of LLMs' performance. We conduct extensive experiments on eight state-of-the-art LLMs and analyze how model architecture and scale influence code generation performance across varying coding tasks. All code, evaluation trees, and a public leaderboard are available at https://prismbench.github.io/Demo/
Submission Type: Long submission (more than 12 pages of main content)
Changes Since Last Submission: 1. **Section 3.2.1** (**MDP Formalization**) has been revised to accurately formalize PrismBench’s MDP formulation.
2. **Section 3.5** has been renamed from *Ensuring Evaluation Validity* to *Ensuring Evaluation Validity and Comparability* to better reflect the changes and added content:
- In the section opening, we added:
- Discussion of challenges in dynamic benchmarking approaches (high bias/variance).
- Details on false positives/negatives arising from executing generated solutions against generated tests.
- We added **Subsection 3.5.1** to discuss how our proposed approach addresses high bias.
- We added **Subsection 3.5.2** to discuss how our proposed approach addresses high variance.
- We added **Subsection 3.5.3** to discuss how we mitigate risks from using an LLM to dynamically generate challenges at test time.
- We added **Subsection 3.5.5** to clarify the role of the analyzer agents (judge LLMs) in our evaluation pipeline.
3. We added **Subsection 3.6.5** to compare our proposed metrics against standard coding metrics reported by common coding benchmarks.
4. We added **Subsection 4.1** to clarify that LeetCode-style challenges are an instantiation of end-to-end programming challenges used in our experiments.
5. We revised **Section 4.4** to detail the tuneable parameters of our benchmarking framework and explain the role and effect of each parameter throughout the benchmarking process.
6. We added details on the hardware used for the experiments, alongside the wall-clock time required to run the experiments for each model under study to **Section 4.5**.
7. We further clarified the use of LeetCode-style programming challenges and noted that results may not be indicative of real-world coding capability as a threat to external validity in **Section 7**.
Assigned Action Editor: ~Yossi_Adi1
Submission Number: 6314
Loading