AICA-Bench: Holistically Examining the Capabilities of VLMs in Affective Image Content Analysis

ACL ARR 2026 January Submission2301 Authors

02 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Affective Image Content Analysis, Vision Language Model
Abstract: Vision-Language Models (VLMs) have demonstrated strong capabilities in perception, yet holistic Affective Image Content Analysis (AICA)—which integrates perception, reasoning, and generation into a unified framework—remains underexplored. To address this, we introduce AICA-Bench, a comprehensive benchmark comprising three core tasks: Emotion Understanding (EU), Reasoning (ER), and Generation (EGCG). We evaluate 23 VLMs, revealing critical gaps: models struggle with intensity calibration and suffer from descriptive shallowness in open-ended tasks. To bridge these gaps, we propose Grounded Affective Tree (GAT) Prompting, a training-free framework that integrates visual scaffolding with hierarchical reasoning. Experiments show that GAT effectively corrects intensity errors and significantly enhances descriptive depth, establishing a robust baseline for future affective multimodal research.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources, Data analysis
Languages Studied: English
Submission Number: 2301
Loading