Prompting Underestimates LLM Capability for Time Series Classification

Prompting Underestimates LLM Capability for Time Series Classification

ACL ARR 2026 January Submission8077 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: large language models, time series classification, probing methods, representation analysis, prompt-based evaluation, multimodal models, evaluation methodology

Abstract: Prompt-based evaluations suggest that large language models (LLMs) perform poorly on time-series classification, raising doubts about whether they encode meaningful temporal structure. We show that this conclusion reflects limitations of prompt-based generation rather than the model’s representational capacity by directly comparing prompt outputs with linear probes over the same internal representations. While zero-shot prompting performs near chance, linear probes improve average F1 from 0.15–0.26 to 0.61–0.67, often matching or exceeding specialized time-series models. Layer-wise analyses further show that class-discriminative time-series information emerges in early transformer layers and is amplified by visual and multimodal inputs. Together, these results demonstrate a systematic mismatch between what LLMs internally represent and what prompt-based evaluation reveals, leading current evaluations to underestimate their time-series understanding.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: probing, representation learning, evaluation methodologies, robustness, data shortcuts/artifacts, calibration/uncertainty

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English

Submission Number: 8077

Loading