MTBench: A Multimodal Time Series Benchmark for Temporal Reasoning and Question Answering

MTBench: A Multimodal Time Series Benchmark for Temporal Reasoning and Question Answering

ICLR 2026 Conference Submission22538 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Time Series, Time Series Question Answering

Abstract: Understanding the relationship between textual data and time-series evolution is a critical yet under-explored challenge in applied data science. While multimodal learning has gained traction, existing time-series benchmarks provide limited support for evaluating cross-modal reasoning and complex question answering, both essential for capturing interactions between narrative information and temporal patterns. To bridge this gap, we introduce Multimodal Time Series Benchmark (MTBench), a large-scale benchmark designed to evaluate large language models (LLMs) on the joint reasoning over time-series and text, exemplified through financial and weather domains. MTBench consists of paired time-series and textual data, including financial analysis with aligned stock price movements and weather reports matched to historical temperature records. Unlike existing benchmarks focused on isolated modalities, MTBench offers a comprehensive testbed for language models to jointly reason over structured numerical trends and unstructured textual narratives. MTBench supports diverse tasks that require a deep understanding of both text and time-series data, including forecasting, semantic and technical trend analysis, and news-driven question answering (QA). These tasks assess the model’s ability to capture temporal dependencies, extract key insights from text, and integrate cross-modal information. We benchmark state-of-the-art LLMs on MTBench, providing a systematic analysis of their effectiveness in capturing the causal relationships between textual narratives and temporal patterns. Our findings reveal significant challenges in current models, including difficulty with long-term dependencies, limited causal interpretation in financial and weather dynamics, and insufficient multimodal fusion. MTBench establishes a foundation for advancing multimodal time-series research and for developing the next generation of multimodal models capable of reasoning across narrative and time series data.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 22538

Loading