How to Get Your LLM to Generate Challenging Problems for Evaluation

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Evaluation, Synthetic data, Benchmarking, Question Answering, Code Generation, Math Reasoning
TL;DR: We propose a framework for synthetically generating challenging problems to evaluate LLMs.
Abstract: The pace of evolution of Large Language Models (LLMs) necessitates new approaches for rigorous and comprehensive evaluation. Traditional human annotation is increasingly impracticable due to the complexities and costs involved in generating high-quality, challenging problems. In this work, we introduce **CHASE**, a framework to synthetically generate challenging problems using LLMs without human involvement. For a given task, our approach builds a difficult problem in a bottom-up manner from simpler components in a verifiable way. We implement CHASE to create evaluation benchmarks across three diverse domains on which state-of-the-art LLMs demonstrate severe vulnerabilities.
Submission Number: 129
Loading