Abstract: The rapid advancement of large language models (LLMs) demands robust, unbiased, and scalable evaluation methods. However, human annotations are costly to scale, model-based evaluations are susceptible to stylistic biases, and target-answer-based benchmarks are vulnerable to data contamination and cheating. We propose StructTest, a novel benchmark that evaluates LLMs on their ability to follow compositional instructions and generate structured outputs, providing an unbiased, cost-effective, and difficult-to-cheat evaluation framework. The tasks in StructTest require significant reasoning skills. Assessments are conducted deterministically using rule-based evaluators, which can be easily extended to new tasks {and datasets}. By testing structured outputs across diverse domains—including Summarization, Code, HTML, and Math—and evaluating 17 popular LLMs, we demonstrate that StructTest remains challenging even for top-performing models like Deepseek-V3/R1 and GPT-4o, establishing it as a robust proxy for measuring reasoning capabilities. We believe StructTest offers a critical and complementary approach to achieving objective and comprehensive model evaluation. Our code and data are available at https://anonymous.4open.science/r/StructTest-EF37/README.md.
Submission Length: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Sarath_Chandar1
Submission Number: 5489
Loading