Giving Sensors a Voice: Multimodal JEPA for Semantic Time-Series Embeddings

Gerardo Pastrana; Sina Khoshfetrat Pakazad; Utsav Dutta; Henrik Ohlsson

Giving Sensors a Voice: Multimodal JEPA for Semantic Time-Series Embeddings

Gerardo Pastrana, Sina Khoshfetrat Pakazad, Utsav Dutta, Henrik Ohlsson

Published: 01 Mar 2026, Last Modified: 11 Apr 2026ICLR 2026 TSALM Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Presentation Attendance: No, we cannot present in-person

Keywords: Time Series, Representation Learning, Self-Supervised Learning, JEPA, Multimodal, Channel-Aware

Abstract: We introduce CHARM (Channel-Aware Representation Model), a multimodal architecture for self-supervised time series representation learning that incorporates channel-level textual descriptions into both temporal convolutional and attention layers. This enables the model to reason about sensor identity and inter-channel relationships while remaining invariant to channel ordering. Trained with a Joint Embedding Predictive Architecture (JEPA), CHARM learns temporally stable, noise-robust embeddings by predicting in latent space rather than reconstructing raw signals. Across classification, forecasting, and anomaly detection benchmarks, CHARM's frozen embeddings with a lightweight linear probe match or outperform significantly larger task-specific foundation models.

Track: Research Track (max 4 pages)

Submission Number: 20

Loading