Keywords: large language models, time series classification, probing methods, representation analysis, prompt-based evaluation, multimodal models, evaluation methodology
Abstract: Prompt-based evaluations suggest that large language models (LLMs) perform poorly on time-series classification, raising doubts about whether they encode meaningful temporal structure. We show that this conclusion reflects limitations of prompt-based generation rather than the model’s representational capacity by directly comparing prompt outputs with linear probes over the same internal representations. While zero-shot prompting performs near chance, linear probes improve average F1 from 0.15–0.26 to 0.61–0.67, often matching or exceeding specialized time-series models. Layer-wise analyses further show that class-discriminative time-series information emerges in early transformer layers and is amplified by visual and multimodal inputs. Together, these results demonstrate a systematic mismatch between what LLMs internally represent and what prompt-based evaluation reveals, leading current evaluations to underestimate their time-series understanding.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: probing, representation learning, evaluation methodologies, robustness, data shortcuts/artifacts, calibration/uncertainty
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
Submission Number: 8077
Loading