Can Vision Models Process Physiological Signals? Exploring Visual Tokenization as a Representation Interface

Published: 02 Mar 2026, Last Modified: 13 Mar 2026ICLR 2026 Workshop MM Intelligence PosterEveryoneRevisionsCC BY 4.0
Track: tiny paper (up to 4 pages)
Keywords: visual tokenization, multimodal alignment, emotion recognition, physiological signals, frozen vision transformers, contrastive learning
TL;DR: We explore visual tokenization as a unified representation interface for multimodal emotion recognition, examining when frozen pretrained features succeed and when they require domain adaptation.
Abstract: Multimodal foundation models increasingly use unified representation interfaces, but the tradeoffs between architectural unification and domain-specific tokenization remain unclear. We explore visual tokenization as a representation interface for emotion recognition by converting physiological signals and text into images processed by frozen Vision Transformers. This parameter-efficient approach, which trains only 0.85\% of weights, enables multimodal learning without modality-specific encoders. We identify when frozen pretrained features succeed and when they require domain adaptation. We also discuss design considerations for balancing unified processing with domain-appropriate inductive biases.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 39
Loading