Abstract: Recent advances in deep learning for robotic vision have yielded remarkable performance for robot-object interaction, including scene understanding and visual affordance learning. Nevertheless, the intrinsic opaque nature of deep neural networks and the consequent lack of explainability pose significant c hallenges. U nderstanding h ow t hese intelligent systems perceive and justify their decision remains elusive to human comprehension. Although research efforts have focused extensively on enhancing the explainability of object recognition, achieving explainability in visual affordance learning for intelligent systems remains an ongoing challenge. To address this issue, we propose a novel post-hoc multimodal explainability framework that capitalizes on the emerging synergy between visual and language models. Our proposed framework initially generates a Class Activation Map (CAM) heatmap for the given affordances to provide visual explainability. It then systematically extracts textual explanations from the state-of-the-art Large Language Models (LLM), i.e., GPT-4, using CAM to enrich the explainability of visual affordance learning. In addition, by harnessing the zero-shot learning capabilities of LLMs, we illustrate their capability to intuitively articulate the behaviour of intelligent systems in affordance learning tasks. We evaluate the efficacy of our approach on a comprehensive benchmark dataset for large-scale multi-view RGBD visual affordance learning. This dataset comprises 47,210 RGBD images spanning 37 object categories annotated with 15 visual affordance categories. Our experimental findings u nderscore t he p romising performance of the proposed framework. The code is available at: https://github.com/ai-voyage/affordance-xai.git.
Loading