Can Vision Models Process Physiological Signals? Exploring Visual Tokenization as a Representation Interface

Frida M. E. Westby; Li Meng; Anis Yazidi; Ali Ramezani-Kebrya

Can Vision Models Process Physiological Signals? Exploring Visual Tokenization as a Representation Interface

Frida M. E. Westby, Li Meng, Anis Yazidi, Ali Ramezani-Kebrya

Published: 02 Mar 2026, Last Modified: 13 Mar 2026ICLR 2026 Workshop MM Intelligence PosterEveryoneRevisionsCC BY 4.0

Track: tiny paper (up to 4 pages)

Keywords: visual tokenization, multimodal alignment, emotion recognition, physiological signals, frozen vision transformers, contrastive learning

TL;DR: We explore visual tokenization as a unified representation interface for multimodal emotion recognition, examining when frozen pretrained features succeed and when they require domain adaptation.

Abstract: Multimodal foundation models increasingly use unified representation interfaces, but the tradeoffs between architectural unification and domain-specific tokenization remain unclear. We explore visual tokenization as a representation interface for emotion recognition by converting physiological signals and text into images processed by frozen Vision Transformers. This parameter-efficient approach, which trains only 0.85\% of weights, enables multimodal learning without modality-specific encoders. We identify when frozen pretrained features succeed and when they require domain adaptation. We also discuss design considerations for balancing unified processing with domain-appropriate inductive biases.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 39

Loading