Beyond Flat Taxonomies: Hierarchical Capability Profiling for Time-Series Understanding and Reasoning in Large Models

Yao Yin; Zhenyu Xiao; Musheng Li; Yiwen Liu; Sutong Nan; Yiting He; Ruiqi Wang; Zhenwei Zhang; Qingmin Liao; Yuantao Gu

Beyond Flat Taxonomies: Hierarchical Capability Profiling for Time-Series Understanding and Reasoning in Large Models

Yao Yin, Zhenyu Xiao, Musheng Li, Yiwen Liu, Sutong Nan, Yiting He, Ruiqi Wang, Zhenwei Zhang, Qingmin Liao, Yuantao Gu

Published: 01 Mar 2026, Last Modified: 13 Mar 2026ICLR 2026 TSALM Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Presentation Attendance: No, we cannot present in-person

Keywords: time-series benchmark, multimodal time series understanding, temporal reasoning, large language model

Abstract: Time series analysis is increasingly shifting toward large foundation models capable of multimodal perception and complex temporal reasoning. However, existing benchmarks largely rely on flat task taxonomies, making it difficult to systematically evaluate compositional capabilities and diagnose failure modes in temporal understanding. In this work, we propose a hierarchical capability taxonomy that decomposes time series analysis into interdependent dimensions spanning structural perception, feature extraction, temporal reasoning, sequence matching, and cross-modal understanding. Guided by this taxonomy, we construct a real-world multimodal time-series question answering (TSQA) benchmark comprising 1,724 QA pairs across three complementary subsets—InWild, Match, and Align. The dataset is generated through a multi-stage, consistency-verified pipeline integrating numerical signals, visual representations, domain context, and expert validation. We evaluate closed-source large language models (LLMs), open-source LLMs, and time-series-adapted foundation models (TS-LLMs), revealing that current TS-LLMs are largely dominated by backbone model capacity, with specialized time-series encoders providing only marginal gains under existing alignment paradigms, while multimodal inputs and explicit reasoning strategies substantially improve performance. These results highlight both the limitations of current alignment approaches and the importance of capability-oriented evaluation for advancing robust temporal intelligence in large models.

Track: Research Track (max 4 pages)

Submission Number: 12

Loading