Abstract: Highlights•We introduce the explainable interaction visual question-answering task, which requires answering questions that include reasoning about human activity beyond the given image and explaining how to obtain the answers which makes the answer more trustworthy.•We build a VQER task for evaluating human action-related questions outside the image.•We propose a explaining and answering model by fusing heterogeneous information.
External IDs:dblp:journals/sigpro/FanXLZGZ26
Loading