Are LLMs Ready for English Standardized Tests? A Benchmarking and Elicitation Perspective

Luoxi Tang; Tharunya Sundar; Yuqiao Meng; Shuai Yang; Ankita Patra; Lakshmi Manohar Chippada; Jiqian Zhao; Yi Li; Tunan Zhao; Ting Yang; Weicheng Ma; Zhaohan Xi

Are LLMs Ready for English Standardized Tests? A Benchmarking and Elicitation Perspective

Luoxi Tang, Tharunya Sundar, Yuqiao Meng, Shuai Yang, Ankita Patra, Lakshmi Manohar Chippada, Jiqian Zhao, Yi Li, Tunan Zhao, Ting Yang, Weicheng Ma, Zhaohan Xi

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Intelligent Tutoring Systems, ESTBOOK Benchmark, Cross-modal Reasoning

TL;DR: Benchmarking LLMs in English Standardized Test

Abstract: Large language models (LLMs) are transforming education by enabling powerful tools that enhance learning experiences, particularly in the context of English Standardized Tests (ESTs), which generate significant commercial value in the education industry. However, their fundamental problem-solving capabilities remain largely underexplored. In this work, we evaluate the performance of LLMs on ESTs across a diverse range of question types. We introduce EstBOOK, a comprehensive benchmark designed to evaluate the capabilities of LLMs in solving EST questions. EstBOOK aggregates five widely recognized tests, encompassing 29 question types and over 10,576 questions across multiple modalities, including text, images, audio, tables, and mathematical symbols. Using EstBOOK, we systematically evaluate both the accuracy and inference efficiency of LLMs. Additionally, we propose a breakdown analysis framework that decomposes complex EST questions into task-specific solution steps. This framework allows us to isolate and assess LLM performance at each stage of the reasoning process. Evaluation findings offer insights into the capability of LLMs in educational contexts and point toward targeted strategies for improving their reliability as intelligent tutoring systems.

Primary Area: datasets and benchmarks

Submission Number: 15031

Loading