Keywords: Historical reasoning Temporal understanding Multimodal learning Language models Historical event comprehension Long-context modeling Causal inference Narrative generation Knowledge grounding Cognitive benchmarking, Historical knowledge
Abstract: We present HistoBench, a benchmark and dataset designed to evaluate and improve large language models’ (LLMs) ability to reason about complex, temporally grounded historical narratives. While LLMs perform well on general language tasks, their historical understanding remains limited. HistoBench provides a richly annotated collection of global events, timelines, and causal chains, alongside an interactive timeline and global map to enhance accessibility for research and education. To assess reasoning across multiple depths, we introduce a set of 1,007 historical questions structured around Bloom’s Taxonomy, covering levels from factual recall (Remember) to higher-order reasoning (Evaluate and Create). Our results show that models perform well on spatial and entity recognition but struggle more with temporal reasoning. Among the evaluated systems, DeepSeek-V3 consistently outperforms GPT4o-mini and Gemma-3 across nearly all levels, achieving over 90% accuracy at the most advanced stages of evaluation and creation, highlighting its stronger capacity for complex historical reasoning.
Primary Area: datasets and benchmarks
Submission Number: 24700
Loading