['3c3,4', '< Abstract: Figure 1: Compared to the QA tasks in existing VQA (Antol et al., 2015) and EQA (Das et al., 2018a) datasets, the models to answer EQA tasks in our EQA-MX dataset require the reasoning of questions with multimodal expressions (verbal and nonverbal gestures).', '---', '> Abstract:', '> Understanding human instructions, which often involve multimodal expressions (verbal and nonverbal gestures), is crucial for autonomous agents. While Visual Question Answering (VQA) and Embodied Question Answering (EQA) have advanced instruction comprehension, existing datasets primarily focus on verbal questions and lack the diversity of real-world multimodal interactions and perspectives. To address these limitations, we introduce EQA-MX, a novel large-scale dataset designed for Embodied Question Answering tasks that require reasoning over multimodal expressions. EQA-MX features 8 new EQA tasks, incorporating verbal utterances and nonverbal gestures from multiple visual and verbal perspectives, thereby reducing perspective bias and enhancing model generalizability. Furthermore, we propose VQ-Fusion, a vector quantization-based multimodal learning model that effectively aligns continuous visual and discrete verbal representations through shared codebooks, learning unified concepts across multiple views. Our extensive experimental analyses demonstrate that VQ-Fusion significantly improves the performance of state-of-the-art visual-language models on EQA tasks, achieving up to a 13% increase in accuracy. EQA-MX and VQ-Fusion provide a robust benchmark and methodology for developing more capable models for embodied human-AI interaction.', '7c8', '< Numerous synthetic and real-world datasets exist for VQA, yet their sole focus on verbal questions is a crucial limitation, contrasting with natural multimodal expressions (verbal utterances and nonverbal gestures) used in inquiries. Studies affirm that nonverbal gestures often provide complementary information for understanding verbal questions (McNeill, 2012;Corkum & Moore, 1998;Butterworth et al., 2002;Scaife & Bruner, 1975;Colonnesi et al., 2010;Iverson & Goldin-Meadow, 2005;Kita, 2003;Liszkowski et al., 2004;Chen et al., 2021). For example, in a scene with two differently Table 1: Comparison of the QA datasets. Existing VQA and EQA datasets do not contain nonverbal gestures (NV), multiple verbal (V) perspectives (MVP), contrastive (C) and ambiguous (A) data samples. ‡ Embodied (E) interactions refer to humans interacting using multimodal expressions. † Embodied interactions refer to an agent navigating in an environment. Please check the supplementary for a detailed comparison with other related datasets.', '---', '> While numerous VQA datasets exist, their exclusive focus on verbal questions overlooks the natural multimodal expressions (verbal utterances and nonverbal gestures) prevalent in human communication. This constitutes a crucial limitation for developing truly collaborative autonomous agents. Studies affirm that nonverbal gestures often provide complementary information for understanding verbal questions (McNeill, 2012;Corkum & Moore, 1998;Butterworth et al., 2002;Scaife & Bruner, 1975;Colonnesi et al., 2010;Iverson & Goldin-Meadow, 2005;Kita, 2003;Liszkowski et al., 2004;Chen et al., 2021). For example, in a scene with two differently Table 1: Comparison of the QA datasets. Existing VQA and EQA datasets do not contain nonverbal gestures (NV), multiple verbal (V) perspectives (MVP), contrastive (C) and ambiguous (A) data samples. ‡ Embodied (E) interactions refer to humans interacting using multimodal expressions. † Embodied interactions refer to an agent navigating in an environment. Please check the supplementary for a detailed comparison with other related datasets.', '16,19c17,20', '< Following VQA, embodied question-answering (EQA) tasks have recently been studied in the literature (Yu et al., 2019;Luo et al., 2019;Gordon et al., 2018;Tan et al., 2020). EQA can be bifurcated based on embodied interactions: the first centers on an agent, like a virtual robot, navigating to answer verbal questions (Das et al., 2018a), solely incorporating verbal queries. The second encompasses multimodal expressions, where humans interact with the environment using verbal utterances and gestures (Chen et al., 2021;Islam et al., 2022a). Adopting the latter definition, we designed EQA tasks to comprehend questions using multimodal expressions (verbal uttrances and nonverbal gestures) in embodied settings. For instance, an EQA task may involve pointing to an object and asking "what is that object?" requiring reasoning over multimodal expressions to answer the question.', '< A notable limitation in many existing VQA and EQA datasets is the singular perspective (either speaker or observer) of verbal utterances, unlike real-world interactions where where people use both perspective interchangeably. For instance, a speaker\'s question, "What is the object to the right of the red mug?" could be interpreted as left of the red mug from an observer\'s perspective. This lack of multiple perspectives in existing datasets hinders the development of robust QA models.', '< Similarly, existing VQA and EQA models answer questions from a single visual perspective (Li et al., 2019;Kim et al., 2021;Lu et al., 2019). Multiple views provide complementary information, and varying camera angles capture interactions differently. Aligning visual representations before merging with verbal ones can aid in developing generalized representations and robust comprehension across perspectives. Moreover, the inconsistency of embedding structures, particularly continuous visual and discrete verbal representations, can lead to sub-optimal representations.', '< To address the shortcomings of existing VQA and EQA datasets, we have extended an embodied simulator to develop a large-scale novel dataset, EQA-MX, for comprehending EQA tasks (Table 9). We have addressed the limitations of existing multimodal fusion approaches and developed a multimodal learning model for EQA tasks, VQ-Fusion, using vector quantization (VQ). The VQ-based bottleneck plays a key role in disentangling the continuous visual representations into discrete embeddings and enables salient fusion with discrete verbal representations. We use a shared codebook in VQ to align multiview representations and learn the unified concept shared among multiple views. We highlight our key contributions below:', '---', '> Following VQA, embodied question-answering (EQA) tasks have recently been studied in the literature (Yu et al., 2019;Luo et al., 2019;Gordon et al., 2018;Tan et al., 2020). EQA tasks typically fall into two categories: those where an agent navigates to answer verbal questions (Das et al., 2018a), and those involving multimodal human-environment interactions through verbal utterances and gestures (Chen et al., 2021;Islam et al., 2022a). We adopt the latter definition, designing EQA tasks that require comprehending questions posed with multimodal expressions in embodied settings. For instance, an EQA task might involve pointing to an object and asking "what is that object?", necessitating reasoning over both verbal and nonverbal cues.', '> A notable limitation in many existing VQA and EQA datasets is the singular perspective (either speaker or observer) of verbal utterances, unlike real-world interactions where people use both perspective interchangeably. For instance, a speaker\'s question, "What is the object to the right of the red mug?" could be interpreted as left of the red mug from an observer\'s perspective. This lack of multiple perspectives in existing datasets hinders the development of robust QA models.', '> Similarly, existing VQA and EQA models answer questions from a single visual perspective (Li et al., 2019;Kim et al., 2021;Lu et al., 2019). Multiple views provide complementary information, and varying camera angles capture interactions differently. Therefore, aligning these multiview visual representations before merging with verbal ones is crucial for developing generalized representations and robust comprehension across diverse perspectives. Moreover, the inconsistency of embedding structures, particularly continuous visual and discrete verbal representations, can lead to sub-optimal multimodal representations.', '> To address the shortcomings of existing VQA and EQA datasets, we have extended an embodied simulator to develop a large-scale novel dataset, EQA-MX, for comprehending EQA tasks (Table 9). Furthermore, we propose VQ-Fusion, a novel vector quantization (VQ)-based multimodal learning model designed to overcome limitations of existing fusion approaches. Its VQ-based bottleneck effectively disentangles continuous visual representations into discrete embeddings, enabling salient fusion with discrete verbal representations. By employing a shared codebook, VQ-Fusion aligns multiview representations and learns unified concepts across multiple visual perspectives. Our key contributions are:', '23c24', '< Visual Question Answering: Many datasets have been developed to study visual questionanswering tasks (Gao et al., 2015;Zhu et al., 2016;Liu et al., 2019;Krishna et al., 2017;Wang et al., 2015;2017;Kembhavi et al., 2016;Kahou et al., 2017;Kafle et al., 2018;Gurari et al., 2018;Hasan et al., 2018;Huang et al., 2018;Andreas et al., 2016;Chou et al., 2020;Hudson & Manning, 2019;Mishra et al., 2019). These datasets primarily involve answering verbal questions using the visual scene as context. For example, Antol et al. (2015) developed a VQA dataset and introduced QA tasks involving an image and verbal questions about the image. This dataset contains both real-world images from the MS-COCO dataset (Lin et al., 2014) as well as synthetic virtual scenes containing clipart. Ren et al. (2015) generated synthetic QA pairs using an algorithm that converts image descriptions into QA form. Recently, a few datasets have been developed containing multimodal expressions (Schauerte & Fink, 2010;Islam et al., 2022b). For example, Chen et al. ( 2021) developed a dataset for referring expression comprehension tasks in embodied settings, where a human uses multimodal expressions to refer to an object.', '---', '> Visual Question Answering: Numerous datasets have been developed to study visual question answering tasks (Gao et al., 2015;Zhu et al., 2016;Liu et al., 2019;Krishna et al., 2017;Wang et al., 2015;2017;Kembhavi et al., 2016;Kahou et al., 2017;Kafle et al., 2018;Gurari et al., 2018;Hasan et al., 2018;Huang et al., 2018;Andreas et al., 2016;Chou et al., 2020;Hudson & Manning, 2019;Mishra et al., 2019). These datasets primarily involve answering verbal questions using the visual scene as context. For example, Antol et al. (2015) developed a VQA dataset and introduced QA tasks involving an image and verbal questions about the image. This dataset contains both real-world images from the MS-COCO dataset (Lin et al., 2014) as well as synthetic virtual scenes containing clipart. Ren et al. (2015) generated synthetic QA pairs using an algorithm that converts image descriptions into QA form. While some recent datasets incorporate multimodal expressions (Schauerte & Fink, 2010;Islam et al., 2022b), such as Chen et al. (2021) which developed a dataset for referring expression comprehension tasks in embodied settings where a human uses multimodal expressions to refer to an object, they typically do not cover the breadth of EQA tasks or diverse perspectives addressed by EQA-MX.', '25c26', '< Embodied Question Answering (EQA): EQA tasks are often designed as agents (e.g., virtual robots) navigating in an environment to answer a verbal question. For example, Das et al. (2018a) developed a synthetic dataset, where a virtual robot navigates the environment and gathers visual information from an egocentric view to answer a verbal question. Yu et al. (2019) extend this dataset and include questions with multiple targets, such as finding multiple objects through navigation. However, some works have used embodied interactions to refer to comprehending referring expressions (Islam et al., 2022b;Chen et al., 2021). We follow this definition of embodied interaction.', '---', '> Embodied Question Answering (EQA): EQA tasks are often designed as agents (e.g., virtual robots) navigating in an environment to answer a verbal question. For example, Das et al. (2018a) developed a synthetic dataset, where a virtual robot navigates the environment and gathers visual information from an egocentric view to answer a verbal question. Yu et al. (2019) extend this dataset and include questions with multiple targets, such as finding multiple objects through navigation. While some works have used embodied interactions to refer to comprehending referring expressions (Islam et al., 2022b;Chen et al., 2021), our work distinctly focuses on embodied interactions that encompass multimodal expressions, aligning with a more comprehensive understanding of human-AI communication.', '395d395', '< ']
