AURA: Visually Interpretable Affective Understanding via Robust Archetypes

ICLR 2026 Conference Submission20735 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: valence-arousal, facial expressions, action units
Abstract: Text-driven vision--language methods (e.g., CLIP variants) face three persistent hurdles in affective computing: (i) limited support for continuous regression (e.g., Valence--Arousal), (ii) brittle reliance on language prompts, and (iii) the absence of a unified paradigm across expression classification, action-unit detection, and affective regression. We introduce AURA, a prompt-free framework that operates directly in a frozen CLIP visual space via visual archetypes. AURA comprises two components: (1) self-organized archetype discovery, which adaptively allocates the number of archetypes per affective state, assigning denser archetype sets to complex or ambiguous states for fine-grained interpretability, and (2) archetype contextualization, which models interactions among the most relevant archetypes and semantic tokens to enhance structural consistency while suppressing redundancy. Inference reduces to cosine matching between projected features and fixed archetypes. Across six datasets, AURA consistently surpasses prior state-of-the-art while remaining highly efficient. Overall, AURA unifies classification, detection, and regression under a single visual-archetype paradigm, delivering strong accuracy, cognitively aligned interpretability, and excellent training/inference efficiency.
Supplementary Material: pdf
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 20735
Loading