Toward Trustworthy Difficulty Assessments: Large Language Models as Judges in Programming and Synthetic Tasks

Toward Trustworthy Difficulty Assessments: Large Language Models as Judges in Programming and Synthetic Tasks

ICLR 2025 Workshop BuildingTrust Submission7 Authors

31 Jan 2025 (modified: 06 Mar 2025)Submitted to BuildingTrustEveryoneRevisionsBibTeXCC BY 4.0

Track: Tiny Paper Track (between 2 and 4 pages)

Keywords: Large Language Models (LLMs), Competitive Programming Difficulty, GPT-4o, Interpretable Machine Learning, Interpretable Model, Synthetic Problem Generation, Numeric Constraints, Automated Assessment

TL;DR: Evaluating the reliability of GPT-4o in programming difficulty classification and synthetic problem generation, revealing its biases and underperformance compared to interpretable models like LightGBM.

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language processing but face challenges in structured tasks such as predicting the difficulty of competitive programming problems. We compare GPT-4o against an interpretable LightGBM ensemble on a dataset of 1,825 LeetCode problems labeled Easy, Medium, or Hard. Our experiments reveal that GPT-4o achieves only 37.75\% accuracy, significantly below the 86\% achieved by LightGBM. Detailed analyses, including confusion matrices and SHAP-based interpretability, highlight that numeric constraints play a crucial role in classifying harder problems. By contrast, GPT-4o often overlooks such details and exhibits a bias toward simpler categories. Additionally, we investigate GPT-4o's performance in generating and classifying synthetic Hard problems. Surprisingly, GPT-4o labels almost all synthetic Hard problems as Medium, contradicting its behavior on real Hard problems. These findings have implications for automated difficulty assessment, educational platforms, and reinforcement learning pipelines reliant on LLM-based evaluations.

Submission Number: 7

Loading