Keywords: video affective reasoning; multi-modal learning
Abstract: Oddly Satisfying Videos (OSVs) elicit psychological comfort through precise audio-visual stimuli. However, existing MLLMs predominantly focus on high-level semantic recognition, often overlooking fine-grained sensory dynamics and underlying affective mechanisms. To bridge this gap, we present \textbf{OSVAR}, a psychophysics-driven multimodal framework designed for Oddly Satisfying Video Affective Reasoning. The proposed OSVAR injects domain-specific sensory priors into multi-modal models through three distinct mechanisms: (1) \textbf{Visual Haptics}: which models motion predictability via optical flow intensity to capture the ``visual order'' inherent in satisfying content; (2) \textbf{Acoustic Purity}: which aligns features with ASMR triggers via constraints on dynamic range, non-speech probability, and timbre consistency; and (3) \textbf{Synesthesia}: which enforces cross-modal congruence via a fine-grained synchronization loss. Extensive experimental results on the constructed dataset demonstrate that OSVAR significantly outperforms state-of-the-art baselines in multiple affective reasoning tasks, offering a novel direction for sensory-aware multimodal understanding.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: cross-modal information extraction;
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: English
Submission Number: 9970
Loading