Multimodal Error Correction with Natural Language and Pointing Gestures

Published: 01 Jan 2023, Last Modified: 05 Mar 2025ICCV (Workshops) 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Error correction is crucial in human-computer interaction, as it can provide supervision for incrementally learning artificial intelligence. If a system maps entities like objects or persons with unknown class to inappropriate existing classes, or misrecognizes entities from known classes when there is too high train-test discrepancy, error correction is a natural way for a user to improve the system. Provided an agent with visual perception, if such entity is in the view of the system, pointing gestures can dramatically simplify the error correction. Therefore, we propose a modularized system for multimodal error correction using natural language and pointing gestures. First, pointing line generation and region proposal detects whether there is a pointing gesture, and if yes, which candidate objects (i.e. RoIs) are on the pointing line. Second, these RoIs (if any) and the user’s utterances are fed into a VL-T5 network to extract and link both the class name and the corresponding RoI of the referred entity, or to output that there is no error correction. In the latter case, the utterances can be passed to a downstream component for Natural Language Understanding. We use additional, challenging annotations for an existing real-world pointing gesture dataset to evaluate our proposed system. Furthermore, we demonstrate our approach by integrating it on a real-world steerable laser pointer robot, enabling interactive multimodal error correction and thus incremental learning of new objects.
Loading