Abstract: Recent advancements in Large Vision-Language Models (LVLMs) have attracted considerable attention due to their impressive performance across various downstream tasks. However, these tasks predominantly emphasize third-person perspectives and LVLMs demonstrate inadequate capability in reasoning from a first-person perspective. To mitigate these limitations, we introduce Causal Intervention with Active Learning (CIAL), an innovative approach designed to augment the first-person reasoning capabilities of LVLMs. Specifically, our method first incorporates an Active Learning-driven Knowledge Extraction (ALKE) scheme, which utilizes LVLMs themselves to automatically and autonomously acquire knowledge related to egocentric perspectives. Then, to optimize the model’s response in conjunction with the scene, a Knowledge-guided Causal Intervention (KCI) module is employed, thereby LVLMs can integrate both knowledge and certainty scores for inference. Comprehensive experiments conducted on the EgoThink benchmark demonstrate that our CIAL method significantly improves the models’ ability to understand and reason in egocentric contexts. Our anonymous code is available at https://github.com/running-alpaca/CIAL/.
External IDs:dblp:conf/icmcs/MengLWYPYX25
Loading