CP-VLM: Causal Prompting for Human Intention Inference with Vision–Language Models

KAZUKI OSAMURA; Hidetsugu Uchida; Narishige Abe

CP-VLM: Causal Prompting for Human Intention Inference with Vision–Language Models

KAZUKI OSAMURA, Hidetsugu Uchida, Narishige Abe

Published: 25 Mar 2026, Last Modified: 28 May 2026CVPR 2026 Workshop CogVL PosterEveryoneRevisionsBibTeXCC BY 4.0

Track: Track 2: Papers without Workshop Proceedings

Keywords: Human Intention Inference, Causal Reasoning, Vision–Language Models

Abstract: Understanding human intentions from visual observations is crucial for proactive collaboration in human–robot interaction. Although recent vision–language models excel in scene description, they struggle to capture goal-directed reasoning beyond surface correlations underlying human activity. To address this limitation, we propose CP-VLM, a framework that enhances intention inference through causally inspired prompting. Our approach guides the model to reason about latent structures in a conceptual manner using structured prompts reflecting human action dynamics. We apply low-rank adaptation to fine-tune the language decoder efficiently. Experiments on the JRDB-Social dataset show that CP-VLM outperforms baselines by +27.8% F1 while maintaining comparable inference time. These results confirm that incorporating causal structure enables deeper, more structured intention inference with minimal computational overhead.

Submission Number: 3

Loading