CP-VLM: Causal Prompting for Human Intention Inference with Vision–Language Models

Published: 25 Mar 2026, Last Modified: 28 May 2026CVPR 2026 Workshop CogVL PosterEveryoneRevisionsBibTeXCC BY 4.0
Track: Track 2: Papers without Workshop Proceedings
Keywords: Human Intention Inference, Causal Reasoning, Vision–Language Models
Abstract: Understanding human intentions from visual observations is crucial for proactive collaboration in human–robot interaction. Although recent vision–language models excel in scene description, they struggle to capture goal-directed reasoning beyond surface correlations underlying human activity. To address this limitation, we propose CP-VLM, a framework that enhances intention inference through causally inspired prompting. Our approach guides the model to reason about latent structures in a conceptual manner using structured prompts reflecting human action dynamics. We apply low-rank adaptation to fine-tune the language decoder efficiently. Experiments on the JRDB-Social dataset show that CP-VLM outperforms baselines by +27.8% F1 while maintaining comparable inference time. These results confirm that incorporating causal structure enables deeper, more structured intention inference with minimal computational overhead.
Submission Number: 3
Loading