Title: EQA-MX: EMBODIED QUESTION ANSWERING USING MULTIMODAL EXPRESSION

Abstract:
Understanding human instructions, which often involve multimodal expressions (verbal and nonverbal gestures), is crucial for autonomous agents. While Visual Question Answering (VQA) and Embodied Question Answering (EQA) have advanced instruction comprehension, existing datasets primarily focus on verbal questions and lack the diversity of real-world multimodal interactions and perspectives. To address these limitations, we introduce EQA-MX, a novel large-scale dataset designed for Embodied Question Answering tasks that require reasoning over multimodal expressions. EQA-MX features 8 new EQA tasks, incorporating verbal utterances and nonverbal gestures from multiple visual and verbal perspectives, thereby reducing perspective bias and enhancing model generalizability. Furthermore, we propose VQ-Fusion, a vector quantization-based multimodal learning model that effectively aligns continuous visual and discrete verbal representations through shared codebooks, learning unified concepts across multiple views. Our extensive experimental analyses demonstrate that VQ-Fusion significantly improves the performance of state-of-the-art visual-language models on EQA tasks, achieving up to a 13% increase in accuracy. EQA-MX and VQ-Fusion provide a robust benchmark and methodology for developing more capable models for embodied human-AI interaction.

Section: INTRODUCTION
Understanding human instructions is crucial for autonomous agents to effectively collaborate with humans (Chen et al., 2021;Kratzer et al., 2020;Islam et al., 2022b;a). To develop models for instruction comprehension, several tasks have been designed, such as referring expression comprehension (Yang et al., 2019b;Yu et al., 2016;Kamath et al., 2021;Akula et al., 2021;Chen et al., 2020a), spatial relations grounding (Yang et al., 2019a;Viethen & Dale, 2008;Achlioptas et al., 2020;Liu et al., 2022), and visual question answering (Antol et al., 2015;Gao et al., 2015;Yu et al., 2015;Zhu et al., 2016;Krishna et al., 2017;Kafle et al., 2018;Gurari et al., 2018). Notably, VQA has gained significant attention due to its complex reasoning demands, such as answering questions about object presence and category using visual cues (Antol et al., 2015;Goyal et al., 2017;Lee et al., 2022).
While numerous VQA datasets exist, their exclusive focus on verbal questions overlooks the natural multimodal expressions (verbal utterances and nonverbal gestures) prevalent in human communication. This constitutes a crucial limitation for developing truly collaborative autonomous agents. Studies affirm that nonverbal gestures often provide complementary information for understanding verbal questions (McNeill, 2012;Corkum & Moore, 1998;Butterworth et al., 2002;Scaife & Bruner, 1975;Colonnesi et al., 2010;Iverson & Goldin-Meadow, 2005;Kita, 2003;Liszkowski et al., 2004;Chen et al., 2021). For example, in a scene with two differently Table 1: Comparison of the QA datasets. Existing VQA and EQA datasets do not contain nonverbal gestures (NV), multiple verbal (V) perspectives (MVP), contrastive (C) and ambiguous (A) data samples. ‡ Embodied (E) interactions refer to humans interacting using multimodal expressions. † Embodied interactions refer to an agent navigating in an environment. Please check the supplementary for a detailed comparison with other related datasets.

Section: Datasets
V NV E EQA MVP Views C A No. of Images
No. of Samples Exo Ego Top VQA (Antol et al., 2015) ✓ ✗ ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ 204k 614k KB-VQA (Wang et al., 2015) ✓ ✗ ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ 0.7k 5k FBQA (Wang et al., 2017) ✓ ✗ ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ 2k 5k VQA-MED (Hasan et al., 2018) (Mathew et al., 2021) (Lee et al., 2022) ✓ ✗ ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ 8k 445k VIMA (Jiang et al., 2022)
✓ ✗ ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ 2k 6k DocVQA
✓ ✗ ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ 12k 50k GRiD-3D
✓ ✗ ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ 650k 650k EQA † (Das et al., 2018a) ✓ ✗ ✓ † ✓ † ✗ ✗ ✓ † ✗ ✗ ✗ 5k 5k MT-EQA † (Das et al., 2018a) ✓ ✗ ✓ † ✓ † ✗ ✗ ✓ † ✗ ✗ ✗ 19k 19k EQA-MX ‡ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 750k 8,243k
colored balls, a pointing gesture can clarify questions like "what is the color of that ball?" The absence of nonverbal interactions in prior VQA datasets makes them less suitable for developing models to comprehend question-answering (QA) tasks in embodied settings.
Following VQA, embodied question-answering (EQA) tasks have recently been studied in the literature (Yu et al., 2019;Luo et al., 2019;Gordon et al., 2018;Tan et al., 2020). EQA tasks typically fall into two categories: those where an agent navigates to answer verbal questions (Das et al., 2018a), and those involving multimodal human-environment interactions through verbal utterances and gestures (Chen et al., 2021;Islam et al., 2022a). We adopt the latter definition, designing EQA tasks that require comprehending questions posed with multimodal expressions in embodied settings. For instance, an EQA task might involve pointing to an object and asking "what is that object?", necessitating reasoning over both verbal and nonverbal cues.
A notable limitation in many existing VQA and EQA datasets is the singular perspective (either speaker or observer) of verbal utterances, unlike real-world interactions where people use both perspective interchangeably. For instance, a speaker's question, "What is the object to the right of the red mug?" could be interpreted as left of the red mug from an observer's perspective. This lack of multiple perspectives in existing datasets hinders the development of robust QA models.
Similarly, existing VQA and EQA models answer questions from a single visual perspective (Li et al., 2019;Kim et al., 2021;Lu et al., 2019). Multiple views provide complementary information, and varying camera angles capture interactions differently. Therefore, aligning these multiview visual representations before merging with verbal ones is crucial for developing generalized representations and robust comprehension across diverse perspectives. Moreover, the inconsistency of embedding structures, particularly continuous visual and discrete verbal representations, can lead to sub-optimal multimodal representations.
To address the shortcomings of existing VQA and EQA datasets, we have extended an embodied simulator to develop a large-scale novel dataset, EQA-MX, for comprehending EQA tasks (Table 9). Furthermore, we propose VQ-Fusion, a novel vector quantization (VQ)-based multimodal learning model designed to overcome limitations of existing fusion approaches. Its VQ-based bottleneck effectively disentangles continuous visual representations into discrete embeddings, enabling salient fusion with discrete verbal representations. By employing a shared codebook, VQ-Fusion aligns multiview representations and learns unified concepts across multiple visual perspectives. Our key contributions are:
• We developed a large-scale dataset (EQA-MX) with multimodal expressions from various verbal and visual perspectives to reduce perspective bias and enhance model generalizability. • We designed 8 new EQA tasks blending multimodal questions (verbal and gestural) to be addressed using visual context in an embodied setting. • We designed a VQ-based multimodal fusion method to align continuous visual and discrete verbal representations and extract salient representations across multiple visual and verbal perspectives. • Our extensive experimental analyses indicate that our proposed model, VQ-Fusion, can help to improve the performance of EQA tasks up to 13%. 

Section: RELATED WORK
Visual Question Answering: Numerous datasets have been developed to study visual question answering tasks (Gao et al., 2015;Zhu et al., 2016;Liu et al., 2019;Krishna et al., 2017;Wang et al., 2015;2017;Kembhavi et al., 2016;Kahou et al., 2017;Kafle et al., 2018;Gurari et al., 2018;Hasan et al., 2018;Huang et al., 2018;Andreas et al., 2016;Chou et al., 2020;Hudson & Manning, 2019;Mishra et al., 2019). These datasets primarily involve answering verbal questions using the visual scene as context. For example, Antol et al. (2015) developed a VQA dataset and introduced QA tasks involving an image and verbal questions about the image. This dataset contains both real-world images from the MS-COCO dataset (Lin et al., 2014) as well as synthetic virtual scenes containing clipart. Ren et al. (2015) generated synthetic QA pairs using an algorithm that converts image descriptions into QA form. While some recent datasets incorporate multimodal expressions (Schauerte & Fink, 2010;Islam et al., 2022b), such as Chen et al. (2021) which developed a dataset for referring expression comprehension tasks in embodied settings where a human uses multimodal expressions to refer to an object, they typically do not cover the breadth of EQA tasks or diverse perspectives addressed by EQA-MX.
Several visual-language (VL) models have been developed for VQA tasks and were consequentially evaluated on these datasets (Radford et al., 2021;Lu et al., 2019;Tan & Bansal, 2019;Chen et al., 2020a). For example, Liunian et al. (Li et al., 2019) developed VisualBERT to answer a question using the visual context by learning multimodal representations from visual and verbal embeddings. Kim et al. (Kim et al., 2021) designed a VL Transformer model (ViLT) with monolithic processing of visual inputs to learn VL representations without regional supervision of object detection.
Embodied Question Answering (EQA): EQA tasks are often designed as agents (e.g., virtual robots) navigating in an environment to answer a verbal question. For example, Das et al. (2018a) developed a synthetic dataset, where a virtual robot navigates the environment and gathers visual information from an egocentric view to answer a verbal question. Yu et al. (2019) extend this dataset and include questions with multiple targets, such as finding multiple objects through navigation. While some works have used embodied interactions to refer to comprehending referring expressions (Islam et al., 2022b;Chen et al., 2021), our work distinctly focuses on embodied interactions that encompass multimodal expressions, aligning with a more comprehensive understanding of human-AI communication.
Several models have been developed for existing EQA tasks. For instance, Das et al. (2018b) introduced a modular model for learning a policy to navigate and answer verbal questions, while Gao et al. (2021) utilized a transformer-based model to generate scene memory tokens as exploration clues. These models aim to develop a navigation policy for answering verbal questions.
Most current VQA and EQA studies focus on understanding solely verbal questions, contrasting our goal of comprehensively understanding multimodal expressions (verbal utterances and gestures) in embodied settings. Moreover, existing models fuse disparate embedding structures (continuous visual and discrete verbal representations), potentially leading to sub-optimal VL representations. We have created 8 novel EQA tasks: Existence Prediction (EP), Object Grounding (OG), Perspective-Aware Object Grounding (POG), Object Counting (OC), Object Attribute Query (OAQ), Object Attribute Compare (OAC), Perspective Grounding (PG), and Relation Grounding (RG). Similar tasks have been developed in prior works (Antol et al., 2015;Lee et al., 2022;Wang et al., 2017;Yu et al., 2016;Ren et al., 2015;Zhu et al., 2016), however, those tasks involve only verbal questions.
We are the first to design QA tasks in embodied settings where a human avatar asks questions using verbal utterances and nonverbal gestures in a virtual environment.
Each task has multiple sub-templates for variation (described further in the supplementary materials). In Fig. 2, we provide samples of these EQA tasks.
Existence Prediction (EP): The EP task involves determining whether the scene contains a particular object with specific attributes (e,g., color, location). Completing this task requires knowledge of object appearances as well as a holistic understanding of the scene.
Object Grounding (OG): In the OG task, the object category is determined based on the question by utilizing multimodal expressions. This task also involves understanding which perspective the question is asked from (i.e., speaker, observer, neutral).
Perspective-Aware Object Grounding (POG): Similar to the OG task, the POG task involves determining which object is being referred to. However, this task also includes the verbal perspective in the question (speaker, observer, neutral). This is done intentionally to determine how much of an impact verbal perspective has on performance.
Object Counting (OC): In the OC task, the number of objects in a scene is asked based on different spatial relations. To understand this, different objects in the visual scene must be attended to and spatial relations given in the verbal question must be used to determine whether or not certain objects have that attribute prior to counting.
Object Attribute Query (OAQ): The OAQ task involves determining the color of a given object that is queried for, which can be helpful in scenarios where humans are interested in learning particular characteristics of an object. The spatial location and the color of the object must be determined using the given verbal and nonverbal expressions.
Object Attribute Compare (OAC): The OAC task entails comparing two objects' attributes, involving pointing to an object and querying their similarity in attributes.
Perspective Grounding (PG): Understanding human verbal perspective is crucial for effective human-AI communication, as humans describe objects from varying perspectives. We simulate this by employing three perspectives -neutral, speaker, and observer, tasking the model with identifying the perspective of a given question.

Section: Relation Grounding (RG):
The RG task involves determining whether a verbal utterance and nonverbal gestures refer to the same object. As a referring expression can be interpreted differently from different visual and verbal perspectives, understanding the RG task requires complex reasoning of perspective and spatial relations.

Section: DATASET GENERATION WITH EQA SIMULATOR
In this work, we have extended the CAESAR simulator (Islam et al., 2022b) to generate data for different EQA tasks. CAESAR is used to randomly generate environments where an actor simulates nonverbal expressions through a pointing gesture and gaze in a scene (Fig. 3). Verbal expressions are created based on the visual scene. To increase the dataset's generalizability, we have used multiple environments. These environments differ in terms of camera views, object locations, and nonverbal/verbal expressions. In each visual scene, we generated four different situations, 1) a situation with no human and therefore no nonverbal expressions, 2) a situation with a human head gaze, 3) a situation with a human pointing gesture, and 4) a situation involving a human using a head gaze and a pointing gesture.
Generated nonverbal expressions consist of a pointing gesture and gaze. Pointing gestures are procedurally generated using inverse kinematics through the Unity engine. We create these pointing gestures based on random noise added onto real-world data of human pointing gestures captured using an Optitrack motion capture system (opt). Similarly, we have simulated human head gazes using inverse kinematics and an object location within the scene as a target. Verbal questions are generated based on different templates for each EQA task. The nonverbal and verbal expressions may describe the same object, or be contrastive, meaning the nonverbal and verbal expressions describe different objects. We use these contrastive instructions for the Relation Grounding task. Additionally, the absence of nonverbal gestures in situations with no humans generates ambiguous data samples. Please check the supplementary document for additional details on the data generation process. We have generated a novel largescale dataset, EQA-MX, containing 8, 243, 893 samples across the 8 tasks described in Sect. 3. The training, validation, and test set splits for each of these tasks is shown in Table 2. We removed some data samples to generate balanced dataset splits for the OAC, PG, and RG tasks.

Section: DATASET ANALYSIS
Our designed EQA tasks vary in terms of the goals (Fig. 2) and visual-verbal contextual information in the questions. This is made apparent by the variance in question lengths in words (Fig. 3(a)). Questions are as short as 6 words for the EP task and as long as 34 words for the OG task. Additionally, one of the main focuses of the EQA-MX dataset is to introduce data that varies in verbal and visual perspectives. Fig. 3(b) demonstrates the PG task's outcome of different verbal perspectives. Similarly,Fig. 3(c) shows the location of objects based on spatial relations in questions from verbal perspectives. Fig. 3(c) also demonstrates how objects being referred to as on the left (blue) and right (red) are not linearly separable through the use of spatial relations, as different verbal perspectives use different relations to describe an object. For example, consider a speaker describing the red table lamp in Fig. 2. The speaker could state "the red lamp on the left". However, from the observer's perspective (exo view) the table lamp is on the right. Thus, given the verbal perspectives, spatial relations are non-separable in EQA-MX (Fig. 3(b)). This reduced verbal and visual perspective biases in EQA-MX dataset can help train robust models for comprehensively comprehending EQA tasks. Please check the supplementary materials for a more detailed data analysis.

Section: VQ-FUSION: VQ-BASED MULTIMODAL FUSION
We develop a vector quantization-based multimodal fusion approach, VQ-Fusion, to learn visuallanguage representations. As EQA tasks in EQA-MX involve multiple visual views, VQ-Fusion extracts visual representations from multiple visual views (X ego , X exo , and X top ) and verbal questions (X q ) for different EQA tasks (Fig. 4 and Sect. 3). Following the existing adapter-based learning models (Beck et al., 2022;Ansell et al., 2021;Rücklé et al., 2021;Pfeiffer et al., 2020a;b;2021), we design VQ-Fusion as an adapter model that can be used in existing models without significantly changing the existing model architecture.
Visual and Language Representation Learning: At first, VQ-Fusion extracts visual and language representations using a state-of-the-art visual encoder (e.g., ResNet (He et al., 2016) and ViT (Dosovitskiy et al., 2020)) and language model (e.g., BERT (Devlin et al., 2018)). VQ-Fusion uses shared models to extract the visual representations from multiple views independently: , exo, top, verbal). Here, F m is the visual or verbal encoders, E m ∈ R Dm , and D m is the representation dimension of modality m. 
E m = F m (X m ), m ∈ (ego

Section: Discretization and Multimodal Fusion:
Language models create discretized representations, whereas visual encoders produce continuous representations of visual scenes. Fusing these representations with different embedding structures can lead to sub-optimal multimodal representations (Liang et al., 2022). For this reason, we discretize the visual representations before fusion.
In VQ-Fusion, we adopted the vector quantization (VQ) method from VQ-VAE (Van Den Oord et al., 2017) and Discrete-Value Neural Communication (Liu et al., 2021) works to discretize multiview visual representations,
E m ∈ (E ego , E exo , E top ).
Previous works use VQ to discretize a representation using codebooks, whereas we use shared codebooks to discretize and align multiview representations to learn unified concepts across visual views for extracting salient multimodal representations.
First, VQ-Fusion divides each E m into G continuous segments (s (m,1) , s (m,2) , . . . , s (m,G) ), where E m = CONCAT(s (m,1) , s (m,2) , . . . , s (m,G) ) and s (m,i) ∈ R Dm/G . Second, VQ-Fusion independently maps continuous segment s (m,i) to discrete latent code c j ∈ R L×(Dm/G) using shared codebooks C, where L is codebooks size (i.e., number of categorical codes in each codebook). We can find the optimal code for each continuous segment s (m,i) from the codebooks in the following way:
e (m,oi) = F D (s (m,i) ), o i = arg max j∈1...L ||s (m,i) -c j || .
Here, F D is the discretization (D) method. Finally, we concatenate the discretized codes to produce discretized visual representation E D m in the following way:
E D m = CONCAT(F D (s (m,1) ), . . . , F D (s (m,G) ))
. Following the training procedure in (Liu et al., 2021) and (Van Den Oord et al., 2017), we calculate VQ loss for learning the codebooks:
L V Q align = β G G i ||s i -sg(c oi )|| 2 2 .
Here, sg is the stopgradient operator blocking gradients to c oi , and β is a hyperparameter controlling reluctance to change the code. We train the discretization module to learn codebooks using gradient descent with the other parts of VQ-Fusion. As VQ-Fusion employs shared codebooks to discretize visual representation for multiple views, L V Q align loss aids in aligning multiview representations and learning unified concepts across views. This shared codebooks approach allows aligning multiview representation to answer the question with multimodal expressions effectively.
Finally, VQ-Fusion fuses these discretized visual and verbal representations using a self-attention approach to produce task representation:
E f used = m∈M α m E m . Here, α m = exp(γm) m∈M exp(γm) and γ m = (W ) T E m , m ∈ M .
Here, M is the modality list (ego, exo, top, verbal), W is a learnable parameter, and α m is the attention score which is calculated using a 1D-CNN with a filter size of 1.
Task Learning: We use the fused representation, E f used , to learn different EQA tasks
T k : y T k = F T k (E f used ).
Here, F T k is the task learning module, which can be designed based on the EQA task properties. For example, we use a multi-layer perceptron for the object existence task (Sect. 3 and please check the supplementary for further details). Moreover, L task,T k is used to train the model for task
T k : L task,T k (y T k , ŷT k ) = 1 B B i=1 y (T k ,i) log ŷ(T k ,i)
. Finally, we combine the task learning loss (L task,T k ) with the VQ loss (L V Q align ) using task learning weights (W V Q and W task ) to train the VQ-Fusion model:
L = W V Q L V Q align + W task L task,T k .
Variations of VQ-Fusion: VQ-Fusion allows to use existing VL models (e.g., VisualBERT (Li et al., 2019) & ViLT (Kim et al., 2021)) to extract these representations. As the architecture of these VL transformer models is limited to processing a single visual and verbal input, we need to pair the verbal question to each visual view and pass through these models to extract multiview visual and verbal representations. We use these representations in VQ-Fusion to discretize and fuse to produce multimodal representations. Please check the supplementary materials for further details.

Section: EXPERIMENTAL ANALYSIS
In this section, we have presented experimental analyses on our EQA-MX dataset to evaluate the impact of VQ-Fusion in VL models for EQA tasks. We have included additional ablation studies and experimental analyses for another task in the supplementary to evaluate the significance of VQ-Fusion for multimodal representation learning.
Baseline Models: Existing visual-language (VL) models for QA tasks are designed to answer a question using a single visual context. Since our proposed EQA tasks involve three visual views, we extend four VL models to learn multiview representations: Dual-Encoder (ViT+BERT) (Dosovitskiy et al., 2020;Devlin et al., 2018), CLIP (Radford et al., 2021), VisualBERT (Li et al., 2019), and ViLT (Kim et al., 2021). For the Dual-Encoder (ViT+BERT) model, we independently extract visual representations for each view using a shared ViT model and verbal representations using a BERT model. We fuse these visual and verbal representations to produce task representations. For the CLIP models, we pair each visual view to a verbal question and pass this through the model to extract multiple visual and verbal representations and fuse them to produce task representations. For VisualBERT and ViLT, we use ResNet-101 (He et al., 2016) to extract visual representations that are passed through the model with verbal embeddings to produce task representations. Please check the supplementary materials for further details.

Section: COMPARISON OF MULTIMODAL LEARNING MODELS
We evaluated state-of-the-art visual-language (VL) models with and without our VQ-Fusion to learn VL representations for 8 EQA tasks. We varied the number of codebooks to {2, 4, 8, 16} in VQ for each task and reported the best performance. We trained and evaluated these models independently for each task as a single-task model on our EQA-MX dataset. We used data samples with varying nonverbal gestures: gaze and pointing gestures, only gaze, and only pointing gestures. All the visual views (ego, exo, and top) and verbal perspectives (speaker, observer, and neutral) are used to train models and evaluate whether the models can learn generalized representation from diverse data. We report macro-accuracy across all tasks to accurately gauge whether models can effectively understand EQA tasks and are not biased toward a particular class (Table 3).

Section: Results:
The results in Table 3 suggest that incorporating VQ-Fusion in VL models helps to successfully fuse extracted salient multiview representations with verbal representations, and thus improves model performance on EQA tasks. For example, the CLIP model without VQ-Fusion achieves 54.06% accuracy in the object grounding task (OG), whereas incorporating VQ-Fusion in the CLIP model increases the OG task's performance to 65.49%. Similarly, VQ-Fusion improved the CLIP model's performance on the object attribute query task (OAQ) by 12%, the VisualBERT model's performance on the perspective grounding task by 12.74%, the ViLT model's performance on the object attribute comparison (OAC) task by 3.5%, and the DualEncoder model's performance on the relation grounding task (RG) by 13.58%. These performance improvements validate the significance of VQ-Fusion in extracting salient multimodal representations from multiple visual and verbal perspectives for effectively learning EQA tasks.
Discussion: The primary reasoning behind the performance improvement by incorporating VQ-Fusion in VL models lies in its discretization of multiview representations before fusion with discrete verbal representations. VQ-Fusion uses codebooks to discretize and align the visual representations with the discrete structure of verbal representations. Conversely, existing VL models extract continuous monolithic visual representations and fuse them with discrete verbal representations. This structural mismatch leads to sub-optimal multimodal fusion, adversely affecting the extraction of salient task representations and subsequently degrading task performance.
Moreover, as VQ-Fusion uses shared codebooks in the VQ information bottleneck to learn multimodal representations, this codebook sharing enables models to align the multiview representations and learn unified concepts. Learning unified concepts from multiple views is crucial, as multiple views capture the same interaction. Existing models are designed to learn visual and language representations from a single visual perspective. Thus, these models do not have any mechanisms to extract unified concepts from multiple visual views. VQ-Fusion enables these models to learn this unified concept using shared codebooks-based VQ.
Our experimental results also indicate that incorporating additional perspective-related information can help models to successfully ground objects. This is made apparent by the model performance on the perspective-aware object grounding (POG) task being consistently higher then the model performance on the object grounding (OG) task. This is particularly notable as the only difference between these tasks is the presence of the question's verbal perspective (Fig. 2). Thus, these results suggest models need to understand verbal perspective for successfully grounding objects in situations with multiple verbal perspectives.
Although the VL models presented can achieve considerable performance for most of the EQA tasks, these models perform slightly better than random-guessing for the object counting (OC) task. As these models do not use object location-specific information, the models suffer at locating and counting objects given a spatial relation. One possible extension of these models to improve performance for the OC task is learning mechanisms to push VL models to learn object locations. The EQA-MX dataset contains rich annotations of object locations, which can easily be incorporated in developing models more capable of understanding spatial locations.

Section: IMPACT OF NONVERBAL GESTURES (ABLATION STUDY)
We evaluated the impact of nonverbal gestures on learning EQA tasks. We evaluated VQ-Fusion with CLIP models and 8 codebooks on the different splits of EQA-MX dataset: data samples with gaze and gestures, only gaze, only gestures, and without gaze and gestures (this data split contains visual scenes without human).

Section: Results and Discussion:
The results in Table 4 suggest that the model performs is worse for EQA tasks if we train the model using data without nonverbal gestures. For example, the model trained using data without nonverbal gestures achieved only 26.65% accuracy for the object grounding (OG) task, whereas the model trained using data with gaze and pointing gestures achieved 68.61% accuracy for the OG task. This is a trend for all other tasks where the performance improved when gaze and/or pointing gestures were incorporated compared to when it only relied on the verbal message. The performance degradation indicates that the models must learn nonverbal gestures to answer questions with multimodal expressions for EQA tasks.

Section: IMPACT OF VQ CODEBOOKS (ABLATION STUDY)
We evaluated VQ-Fusion with the CLIP model for 8 EQA tasks by varying the number of codebooks in VQ: {2, 4, 8, 16}. We evaluated these models on our EQA-MX with varied nonverbal gestures (gaze and pointing gestures, only gaze, and only pointing gestures). We trained these models with multiple visual and verbal perspectives. Results and Discussion: The results in Table 5 suggest that different codebooks help the model achieve the highest performance for different tasks. For example, VQ-Fusion with 8 codebooks can achieve the highest performance in existence prediction (EP), object grounding (OG), and object attribute compare (OAC) tasks, whereas VQ-Fusion with 2 codebooks can achieve the highest performance for perspective-aware object grounding (POG) and object counting (OC) tasks. The number of codebooks depends on the task complexity of how many concepts need to be learned. As the OG task requires learning verbal perspective, the model requires more codebooks to learn perspective-related concepts. On the other hand, as perspective is already given in the POG task, VQ-Fusion requires fewer codebooks. Our results also show similar phenomena, where VQ-Fusion achieves 82.70% accuracy for the POG task with only 2 codebooks, whereas it achieves 65.49% accuracy for the OG task with 8 codebooks. However, increasing codebooks more than optimal leads to decreasing task performance. For example, the object attributes compare (OAC) task accuracy degrades if we increase the number of codebooks to more than 4. As the OAC task involves whether two objects have the same attribute, the model can learn these simple concepts using fewer codebooks. Increasing the number may lead to sparsity in codebooks, i.e., many codes are left unutilized, hindering models from extracting salient representations. On the other hand, using a few codebooks for complex tasks, such as OG and OAQ, leads to tight bottlenecks, which deters models from learning salient concepts. This results in lower task performance. These results indicate that each task has a different optimal number of codebooks.

Section: CONCLUSION
To develop models for comprehending embodied interactions, we designed 8 novel EQA tasks requiring comprehension of questions with multimodal expressions (verbal and nonverbal gestures). To train and diagnose models for these EQA tasks, we developed a novel large-scale dataset, EQA-MX, which contains questions with multimodal expressions from multiple verbal and visual perspectives. Moreover, we developed a vector quantization-based multimodal representation learning model, VQ-Fusion, to learn salient multimodal representation from multiple visual and verbal perspectives. Our extensive experimental analyses suggest that VQ-Fusion can effectively fuse continuous multiview visual and discrete verbal representation, which helps to improve the visual-language model's performance for all EQA tasks up to 13%.

Section: TECHNICAL APPENDIX A RESOURCES
The EQA-MX dataset, source code for the CAESAR simulator with our modifications, benchmark learning models, trained model checkpoints, and docker for computing environment can be accessed through the following links. We will publicly release these resources with the camera-ready version of our paper. For double-blind reviewing purposes, we are sharing these resources anonymously with the reviewers:
• EQA-MX dataset (162 GB): https://bit.ly/eqa-mx-dataset We built a docker to facilitate easy reproducing of our experimental settings and training environment. We cannot currently share the docker hub link to maintain anonymity. We plan to share that docker link upon publication of the paper. For this reason, we are sharing the singularity container built from the same docker we used for our experimentation: https://bit.ly/multimodal-docker

Section: B BROADER IMPACT
Our dataset contains rich annotations of visual scenes, such as object locations, spatial relations, and multiple visual and verbal perspectives. These can be used to design new tasks to robustly comprehend embodied interactions. Moreover, our EQA-MX dataset can be used for diverse tasks in embodied settings, such as scene segmentation and conversational human-AI interactions with multimodal expressions. Additionally, our dataset can be used to develop and evaluate models that can be transferred to robots for comprehending embodied human instructions in real-world settings. Lastly, our experimental analysis provides valuable insights that can be used in designing robust VL models, such as using similar embedding structures for fusing continuous and discrete representations leading to performance improvements.

Section: C ADDITIONAL EXPERIMENTAL ANALYSES C.1 IMPACT OF MULTIPLE VISUAL PERSPECTIVES AND MODALITIES
In real-world settings, robots are typically equipped with multiple camera views. Several studies have emphasized the significance of multiview data in accurately comprehending human actions and instructions (Kong et al., 2019;Islam & Iqbal, 2022). To further validate the importance of multimodal data (nonverbal gestures captured through visual views and verbal utterances) in understanding embodied question answering (EQA) tasks, we conducted extensive ablation studies with varying visual views (ego, exo and top) and verbal utterances (verbal utterance templates described in Table 11).
In the first setting, we used only verbal utterances for all eight EQA tasks (Table,6: Top). We used BERT (Devlin et al., 2018) for learning the EQA tasks. The results suggest models using only a verbal modality can not effectively learn these EQA tasks. Conversely, if we utilized both verbal and nonverbal data, then the performance of these EQA tasks improved (Table,6). This degraded performance using only verbal data emphasizes the importance of utilizing both verbal and nonverbal data modalities for appropriately learning EQA tasks. Additionally, it also indicates that our proposed EQA-MX dataset is less biased towards verbal data for comprehending EQA tasks. In the second setting, we used verbal utterances and nonverbal gestures to learn EQA tasks. We varied the visual perspectives during training and testing through the use of different camera views (ego, exo, and top) to capture the nonverbal interactions. We used CLIP model to learn EQA tasks involving verbal utterances and visual views. The results suggest that models trained using multiple visual perspectives perform better than models trained using a single visual perspective (Table,6: Bottom). The reasoning behind this performance improvement is that models using multiple visual views can learn generalized multiview representations, which can improve the performance at inference time when visual views are varied.

Section: C.2 COMPARISON OF SINGLE AND MULTITASK MODELS
We evaluated the impact of learning multiple tasks in a visual-language model. We conducted this experimental analysis in two settings. In both settings, we used verbal utterances and multiple visual modalities to learn EQA tasks. In the first setting, we trained CLIP models for each EQA task separately. In the second setting, we trained CLIP models for a subset of EQA tasks. Finally, we used the extracted representation in each EQA task head, where these task heads are designed using an MLP.
The results in Table 7 suggest that the performance of models learning multiple tasks degrades compared to the models learning these tasks separately. As these tasks have different characteristics, learning these tasks together can compete in the representation learning space and degrades these tasks' performance. For example, training the CLIP model for the Existence Prediction (EP) and Object Grounding (OG) tasks together degrades the Object Grounding task performance to 40.76% compared to an accuracy of 65.49% for a separately trained CLIP model for OG task. Previous studies have observed similar performance degradation when learning multiple competing tasks. The primary reason behind the performance degradation is that the competing tasks have conflicting gradients among different tasks that introduce negative knowledge transfer and thus degrade these tasks' performance. Thus, an exciting future research direction would be to design novel multitask model architectures and training approaches where training on multiple tasks using multiple modalities improves the performance of every task in a shared model.

Section: C.3 GENERALIZABILITY OF VQ-FUSION
To evaluate the generalizability of VQ-Fusion for another task involving multimodal representation learning, we incorporate VQ-Fusion in an existing multimodal learning model (HAMLET (Islam & Iqbal, 2020)) for human activity recognition tasks with multimodal sensor data (RGB videos, acceleration, gyroscope, and orientation). We have evaluated this modal on the MMAct dataset (Kong Table 7: We train CLIP models with VQ-Fusion in single task (ST) and multitask (MT) settings. We reported accuracy of these tasks. Tasks trained in an MT setting are grouped together. The results suggest that the performance of these models with multiple tasks degrades compared to models learning these tasks separately. Existence Prediction (EP), Object Grounding (OG), Perspective-Aware Object Grounding (POG), Object Counting (OC), Object Attribute Query (OAQ), Object Attribute Compare (OAC), Perspective Grounding (PG), Relation Grounding (RG). In our experimental analyses, we adhered to the original session-based evaluation settings and reported the F1-score. We have used eight codebooks to discretize the multimodal representations.

Section: ST
The results indicated that the HAMLET model, which utilizes our proposed VQ-Fusion approach, outperformed all existing state-of-the-art multimodal human activity recognition (HAR) approaches in session-based evaluation settings on the MMAct dataset (Table 8). Specifically, the inclusion of VQ-Fusion enabled HAMLET to improve its F1-score by 4.2%, resulting in the highest reported F1-score of 87.69% (Table 8). These findings suggest that VQ-Fusion can effectively aid existing models in extracting salient multimodal representations, thereby enhancing the performance of downstream tasks in the field of HAR.

Section: D TRAINING ENVIRONMENT
We developed all the models using the Pytorch (version: 1.12.1+cu113) (Paszke et al., 2019) and Pytorch-Lightning (version: 1.7.1) (Falcon, 2019) deep learning frameworks. We also used Hug-  (Ofli et al., 2013) 46.52 TSN (RGB) (Wang et al., 2016) 69.20 TSN (Optical-Flow) (Wang et al., 2016) 72.57 MMAD (Kong et al., 2019) 74.58 TSN (Fusion) (Wang et al., 2016) 77.09 MMAD (Fusion) (Kong et al., 2019) 78.82 Keyless (Long et al., 2018) 81.11 HAMLET (Islam & Iqbal, 2020) 83.89 MuMu (Islam & Iqbal, 2022) 87.50 VQ-Fusion(HAMLET) 87.69 VQ-Fusion(MuMu) 87.83  (Schauerte et al., 2010) (Shukla et al., 2015) ✗ ✓ ✓ ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ IMHF (Shukla et al., 2016) (Gao et al., 2015) ✓ ✗ ✗ ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ Visual Madlibs (Yu et al., 2015) ✓ ✗ ✗ ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✗ Visual Genome (Krishna et al., 2017) (Antol et al., 2015) ✓ ✗ ✗ ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✗ VQA (Abs.) (Antol et al., 2015) ✓
✗ ✓ ✓ ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ ReferAt (Schauerte & Fink, 2010) ✓ ✓ ✓ ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ IPO
✗ ✓ ✓ ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ RefIt (Kazemzadeh et al., 2014) ✓ ✗ ✗ ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ RefCOCO (Yu et al., 2016) ✓ ✗ ✗ ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ RefCOCO+ (Yu et al., 2016) ✓ ✗ ✗ ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ RefCOCOg (Mao et al., 2016) ✓ ✗ ✗ ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ Flickr30k (Plummer et al., 2015) ✓ ✗ ✗ ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ GuessWhat? (De Vries et al., 2017) ✓ ✗ ✗ ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ Cops-Ref (Chen et al., 2020b) ✓ ✗ ✗ ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ CLEVR-Ref+ (Liu et al., 2019) ✓ ✗ ✗ ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✗ DAQUAR (Malinowski et al., 2017) ✓ ✗ ✗ ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ FM-IQA
✓ ✗ ✗ ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✗ DVQA (Kafle et al., 2018) ✓ ✗ ✗ ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✗ VQA (COCO)
✗ ✗ ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✗ Visual 7W (Zhu et al., 2016) ✓ ✗ ✗ ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✗ KB-VQA (Wang et al., 2015) ✓ ✗ ✗ ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✗ FBQA (Wang et al., 2017) ✓ ✗ ✗ ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✗ VQA-MED (Hasan et al., 2018) ✓ ✗ ✗ ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ DocVQA (Mathew et al., 2021) ✓ ✗ ✗ ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✗ YouRefIt (Chen et al., 2021) ✓ ✓ ✓ ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✗ GRiD-3D (Lee et al., 2022) ✓ ✗ ✗ ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✗ EQA † (Das et al., 2018a) ✓ ✗ ✗ ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✗ MT-EQA † (Das et al., 2018a) ✓ ✗ ✗ ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✗ CAESAR-L (Islam et al., 2022b) ✓ ✓ ✓ ✗ ✗ ✓ ✓ ✓ ✓ ✓ ✓ CAESAR-XL (Islam et al., 2022b) ✓ ✓ ✓ ✗ ✗ ✓ ✓ ✓ ✓ ✓ ✓ EQA-MX ‡ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
gingFace library (version: 4.21.1) for pre-trained models (BERT1 (Devlin et al., 2018), ViT2 (Dosovitskiy et al., 2020), VisualBERT3 (Li et al., 2019), Dual Encoder4 , ViLT5 (Kim et al., 2021), and CLIP6 (Radford et al., 2021)). For the Dual-Encoder and CLIP models, we used an embedding size of 512, and for VisualBERT and ViLT, we used an embedding size of 768. We train models using the Adam optimizer with a weight decay regularization (Loshchilov & Hutter, 2017) and cosine annealing warm restarts at an initial learning rate: 3e -4 , cycle length (T 0 ): 4, and cycle multiplier (T mult ): 2. We used batch size 128 and trained models for 8 epochs. We used the same fixed random seed (33) for all the experiments to ensure reproducibility. Lastly, all models are trained in distributed GPU clusters, where each node contains 8 A100 GPUs.
All 8 EQA tasks we created described in the paper are catered towards specific scenarios that would benefit models in real-world human interaction scenarios.
Existence Prediction (EP): Naturally, humans are able to determine what objects are present in a given scene. In scenarios where humans are interacting and an actor mistakenly references an object not in the scene, this allows observers to request more information. Created to mimic this situation, the existence prediction task involves determining whether the scene contains a particular object with some specific attributes, such as color.
Object Grounding (OG): Understanding which objects a human refers to using verbal and nonverbal cues is key in successful human-AI interaction. A model successfully able to ground objects has use-cases such as assisting surgeons during a procedure by handing surgeons the correct tools. Thus, we design the object grounding task around this scenario, where models must identify the name of the object being referred to by verbal and nonverbal expressions.
Perspective-Aware Object Grounding (POG): Similar to the object grounding task, the Perspective-Aware Object Grounding involves determining which object is being referred to, but this task includes the verbal perspective (either ego, exo, or neutral). Although real-world human-AI interactions will not always contain the perspective of a given relation, including the perspective allows for us to determine whether or not understanding perspective can help in grounding objects.  for the object grounding (OG) and perspective-aware object grounding (POG) tasks. In the Wordcloud the size of words represents the frequencies that they occur in the verbal utterances. Therefore, the most frequent words describe general properties of objects or are general words inside questions -such as color, perspective, and spatial relations/locations. In the diagrams for object frequencies for the object grounding and perspective-aware object grounding tasks, the most referred objects all have the same frequencies (these tasks have the same object distributions). Expr. : expression, Dist. : distribution.

Section: Object Counting (OC):
As understanding what object a human is referring to in a scene involves interpretation of the different number of objects inside that scene, understanding the number of objects in a scene can serve as an auxiliary task for the object grounding task. If models are able to create salient multimodal representations to attend to all the objects in a given scene, is is likely they will be able to ground particular objects better. Thus, in the object counting task the number of objects in a scene is asked based on different spatial relations.
Object Attribute Query (OAQ): It is often important in human-human interactions to identify particular attributes of objects. Additionally, this information can be used as auxiliary information for tasks such as the Object Grounding task, where the goal is to identify objects. We design the Object Attribute Query task around this particular situation, where the color of a given object is queried for.
Object Attribute Compare (OAC): Humans often exchange information throughout conversations through the use of comparison of different object attributes. This exchange of information can assist in understanding the different objects an actor is referring to. Thus, we design the object attribute compare task, where the attributes of two different objects in the scene are compared.
Perspective Grounding (PG): Understanding human verbal perspective is integral to successful human-AI communication, as humans interchangeably describe objects from their perspective as well as the perspective of others. We simulate this in the perspective grounding task using three different perspectives -neutral, egocentric (speaker), and exocentric (observer).
Relation Grounding (RG): As described in Islam et al. (2022b), the relation grounding task involves determining whether the supplied verbal and nonverbal signals align with respect to describing the same object. Understanding whether or not a human is accurately verbally and nonverbally referring to an object can enable the identification of human mistakes. We add complexity to this task through the variation of verbal perspective in the question. Both distributions are not completely even due to different observed scene probabilities. For the object counting (OC) task, lower numbers have higher probabilities of occurring due to the number of objects in the scene ranging from 4 -10, hence the imbalance in distributions. Similarly, in the object attribute compare task different object colors are queried for, and since the colors of objects is not completely balanced, the task distribution is imbalanced. Dist. = Distribution.

Section: E.1 EQA TASK TEMPLATES
In this work we presented 8 EQA tasks. Each of these tasks has multiple sub-templates, which we present in more detail in Table 11. Each sub-template has multiple degrees of freedom from which to vary, ensuring generated embodied questions are diverse. For example, since most sub-templates use the absolute location of an object, this absolute location can often times be described from either the observer or speaker perspective.

Section: E.2 NEW ENVIRONMENTS IN EQA-MX
To increase dataset generalizability, we have added a shelf environment into the CAESAR simulator, and thus into the EQA-MX dataset. We visualize the three views (ego, exo, and top) for this and the table environment in Fig. 5. Because the exo and ego views in the table environment are on different sides of the table, the verbal perspectives differ. However, in the shelf environment, the exo and ego views are aligned meaning the verbal perspective is aligned. We created this environment in this way to ensure models have differing situations with regards to views and perspective. Additionally, since the shelf has objects below/on top of one another, it adds diversity with respect to spatial relations/locations, ensuring models understanding these relations/locations in all 3 dimensions.

Section: F ADDITIONAL DATASET ANALYSES
We have thoroughly analyzed the different aspects of data samples in our dataset, EQA-MX. We visualize the output distribution for all EQA tasks, as well as the object locations with respect to different spatial relations/locations, and the most frequent words found in our dataset.

Section: F.1 TASK OUTPUT DISTRIBUTIONS
As shown in Figs. 6,7 we balance outputs of our task distributions where possible in order to ensure the EQA-MX dataset is not biased. For the OG and POG tasks, the output distribution of all 52 categories is balanced to ensure models do not bias a particular object. Additionally, in Fig. 6, all binary tasks (EP, OAC, and PG) contain a 50/50 split between yes and no answers. Because the CAESAR simulator randomly generates scenes populated with objects, the OC and OAC tasks do not have even task distributions. This can be explained by these tasks involving observed characteristics in scenes where some characteristics are more common than others. For example, since the max number of objects that can be generated in a scene is 10, the probability of an object have 9 objects to the left of it is much lower than the probability of an object having 2 objects to the left of it. Similarly, certain colors are more common in objects inside of the CAESAR simulator. These distributions are made more apparent in Fig. 8 (we report macro accuracy for models trained on these tasks).

Section: F.2 OBJECT LOCATIONS ANALYSES
We visualize object locations inside the EQA-MX dataset to show how different spatial relations have/don't have bias (Fig. 9). Particularly, since one of our contributions is the creation of the shelf environment, we show how since its visual views are aligned certain visual cues have bias.

Section: Datasets
No. of Images

Section: No. of Samples Object Categories
Avg. Words * PointAt (Schauerte et al., 2010) 220 220 28 -ReferAt (Schauerte & Fink, 2010) 242 242 28 -IPO (Shukla et al., 2015) 278 278 10 -IMHF (Shukla et al., 2016) 1716 1716 28 -RefIt (Kazemzadeh et al., 2014) 19,894 130,525 238 3.61 RefCOCO (Yu et al., 2016) 19,994 142,209 80 3.61 RefCOCO+ (Yu et al., 2016) 19,992 141,564 80 3.53 RefCOCOg (Mao et al., 2016) 26,711 104,560 80 8.43 Flickr30k (Plummer et al., 2015) 31,783 158,280 44,518 -GuessWhat? (De Vries et al., 2017) 66,537 155,280 --Cops-Ref (Chen et al., 2020b) 75,299 148,712 508 14.40 CLEVR-Ref+ (Liu et al., 2019) 99,992 998,743 3 22.40 DAQUAR (Malinowski et al., 2017) 1449 124,68 37 11.5 FM-IQA (Gao et al., 2015) 157,392 316,193 -7.38 Visual Madlibs (Yu et al., 2015) 107,38 360,001 -6.9 Visual Genome (Krishna et al., 2017) 108,000 1,445,332 37 5.7 DVQA (Kafle et al., 2018) 300,000 3,487,194 --VQA (COCO) (Antol et al., 2015) 204,721 614,163 80 6.2 VQA (Abs.) (Antol et al., 2015) 50,000 150,000 100 6.2 Visual 7W (Zhu et al., 2016) 47,300 327,939 36,579 6.9 KB-VQA (Wang et al., 2015) 700 5826 23 6.8 FBQA (Wang et al., 2017) 2190 5826 32 9.5 VQA-MED (Hasan et al., 2018) 2866 6413 --DocVQA (Mathew et al., 2021) 12,767 50,000 --YouRefIt (Chen et al., 2021) 497,348 4,195 395 3.73 GRiD-3D (Lee et al., 2022) 8,000 445,000 28 -EQA † (Das et al., 2018a) 5,000 5,000 50 -MT-EQA † (Das et al., 2018a) 19,287 19,287 61 -CAESAR-L (Islam et al., 2022b) 11,617,626 124,412 61 5.56 CAESAR-XL (Islam et al., 2022b) 841,620   What it the color of that object/thing? What is the color of the <object name>? What is the color of the hand soap dispenser?

Section: Object Attribute Compare


Section: Template
Task Example Is the color of that object/thing the same color as the <relational object name>?
Is the color of that thing the same color as the cheese? Is the color of that <selected object name> the same color as the <relational object name>?
Is the color of that hand soap dispenser the same color as the soda bottle?

Section: Perspective Grounding


Section: Template
Task Example <Referring expressions using the templates from CAESAR>. From which perspective is the object described?
The hand soap dispenser above the soda bottle. From which perspective is the object described?

Section: Relation Grounding


Section: Template
Task Example <Referring expressions using the templates from CAESAR>, is the object referred to appropriately?
The hand soap dispenser above the cucumber, is the object referred to appropriately? Considering the observer's perspective, <Referring expressions using the templates from CAESAR>, is the object referred to appropriately?
Considering the observer's perspective, the hand soap dispenser below the cucumber, is the object referred to appropriately? Considering the speaker's perspective, <Referring expressions using the templates from CAESAR>, is the object referred to appropriately?
Considering the observer's perspective, the hand soap next to the coffee maker, is the object referred to appropriately?

Section: E EMBODIED QUESTION ANSWERING TASK AND DATASET ADDITIONAL INFORMATION
We include additional information on the EQA-MX dataset compared to previous EQA datasets in Tables 9 and10. 21


References:
[b0] Panos Achlioptas; Ahmed Abdelreheem; Fei Xia; Mohamed Elhoseiny; Leonidas J Guibas (2020). ReferIt3D: Neural listeners for fine-grained 3d object identification in real-world scenes. 
[b1] Arjun Akula; Varun Jampani; Soravit Changpinyo; Song-Chun Zhu (2021). Robust visual reasoning via language guided neural module networks. Advances in Neural Information Processing Systems
[b2] Jacob Andreas; Marcus Rohrbach; Trevor Darrell; Dan Klein (2016). Neural module networks. 
[b3] Alan Ansell; Maria Edoardo; Jonas Ponti; Sebastian Pfeiffer; Goran Ruder; Ivan Glavaš; Anna Vulić;  Korhonen (2021). Mad-g: Multilingual adapter generation for efficient cross-lingual transfer. 
[b4] Stanislaw Antol; Aishwarya Agrawal; Jiasen Lu; Margaret Mitchell; Dhruv Batra; C Lawrence Zitnick; Devi Parikh (2015). VQA: Visual Question Answering. 
[b5] Tilman Beck; Bela Bohlender; Christina Viehmann; Vincent Hane; Yanik Adamson; Jaber Khuri; Jonas Brossmann; Jonas Pfeiffer; Iryna Gurevych (2022). Adapterhub playground: Simple and flexible few-shot learning with adapters. 
[b6] George Butterworth; Fabia Franco; B Mckenzie; Lida Graupner; Brenda Todd (2002). Dynamic aspects of visual event perception and the production of pointing by human infants. British Journal of Developmental Psychology
[b7] Yen-Chun Chen; Linjie Li; Licheng Yu; Ahmed El Kholy; Faisal Ahmed; Zhe Gan; Yu Cheng; Jingjing Liu (2020). Uniter: Universal image-text representation learning. Springer
[b8] Yixin Chen; Qing Li; Deqian Kong; Yik Lun Kei; Song-Chun Zhu; Tao Gao; Yixin Zhu; Siyuan Huang (2021). Yourefit: Embodied reference understanding with language and gesture. 
[b9] Zhenfang Chen; Peng Wang; Lin Ma; Qi Kwan-Yee K Wong;  Wu (2020). Cops-ref: A new dataset and task on compositional referring expression comprehension. 
[b10] Shih-Han Chou; Wei-Lun Chao; Wei-Sheng Lai; Min Sun; Ming-Hsuan Yang (2020). Visual question answering on 360deg images. 
[b11] Cristina Colonnesi; Geert Jan; J M Stams; Irene Koster; Marc J Noom (2010). The relation between pointing and language development: A meta-analysis. Developmental Review
[b12] Valerie Corkum; Chris Moore (1998). The origins of joint visual attention in infants. Developmental psychology
[b13] Abhishek Das; Samyak Datta; Georgia Gkioxari; Stefan Lee; Devi Parikh; Dhruv Batra (2018). Embodied question answering. 
[b14] Abhishek Das; Georgia Gkioxari; Stefan Lee; Devi Parikh; Dhruv Batra (2018). Neural modular control for embodied question answering. PMLR
[b15] De Harm; Florian Vries; Sarath Strub; Olivier Chandar; Hugo Pietquin; Aaron Larochelle;  Courville (2017). Guesswhat?! visual object discovery through multi-modal dialogue. 
[b16] Jacob Devlin; Ming-Wei Chang; Kenton Lee; Kristina Toutanova (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. 
[b17] Alexey Dosovitskiy; Lucas Beyer; Alexander Kolesnikov; Dirk Weissenborn; Xiaohua Zhai; Thomas Unterthiner; Mostafa Dehghani; Matthias Minderer; Georg Heigold; Sylvain Gelly (2020). An image is worth 16x16 words: Transformers for image recognition at scale. 
[b18]  Wa Falcon (2019). Pytorch lightning. 
[b19] Chen Gao; Jinyu Chen; Si Liu; Luting Wang; Qiong Zhang; Qi Wu (2021). Room-and-object aware knowledge reasoning for remote embodied referring expression. 
[b20] Haoyuan Gao; Junhua Mao; Jie Zhou; Zhiheng Huang; Lei Wang; Wei Xu (2015). Are you talking to a machine? dataset and methods for multilingual image question. Advances in neural information processing systems
[b21] Daniel Gordon; Aniruddha Kembhavi; Mohammad Rastegari; Joseph Redmon; Dieter Fox; Ali Farhadi (2018). Iqa: Visual question answering in interactive environments. 
[b22] Yash Goyal; Tejas Khot; Douglas Summers-Stay; Dhruv Batra; Devi Parikh (2017). Making the v in vqa matter: Elevating the role of image understanding in visual question answering. 
[b23] Danna Gurari; Qing Li; Abigale J Stangl; Anhong Guo; Chi Lin; Kristen Grauman; Jiebo Luo; Jeffrey P Bigham (2018). Vizwiz grand challenge: Answering visual questions from blind people. 
[b24] Yuan Sadid A Hasan; Oladimeji Ling; Joey Farri; Henning Liu; Matthew Müller;  Lungren (2018-09). Overview of imageclef 2018 medical domain visual question answering task. 
[b25] Kaiming He; Xiangyu Zhang; Shaoqing Ren; Jian Sun (2016). Deep residual learning for image recognition. 
[b26] Li-Chi Huang; Kuldeep Kulkarni; Anik Jha; Suhas Lohit; Suren Jayasuriya; Pavan Turaga (2018). Cs-vqa: visual question answering with compressively sensed images. IEEE
[b27] A Drew; Christopher D Hudson;  Manning (2019). Gqa: A new dataset for real-world visual reasoning and compositional question answering. 
[b28] Mofijul Md; Tariq Islam;  Iqbal (2020). Hamlet: A hierarchical multimodal attention-based human activity recognition algorithm. 
[b29] Mofijul Md; Tariq Islam;  Iqbal (2022-06). Mumu: Cooperative multitask learning-based guided multimodal fusion. 
[b30] Md Mofijul Islam; Alexi Gladstone; Tariq Iqbal (2022). PATRON: Perspective-aware multitask model for referring expression grounding using embodied multimodal cues. 
[b31] Md Mofijul Islam; Reza Manuel Mirzaiee; Alexi Gladstone; Tariq Haley N Green;  Iqbal (2022). CAESAR: A multimodal simulator for generating embodied relationship grounding dataset. 
[b32] M Jana; Susan Iverson;  Goldin-Meadow (2005). Gesture paves the way for language development. Psychological science
[b33] Yunfan Jiang; Agrim Gupta; Zichen Zhang; Guanzhi Wang; Yongqiang Dou; Yanjun Chen; Li Fei-Fei; Anima Anandkumar; Yuke Zhu; Linxi Fan (2022). Vima: General robot manipulation with multimodal prompts. 
[b34] Kushal Kafle; Brian Price; Scott Cohen; Christopher Kanan (2018). Dvqa: Understanding data visualizations via question answering. 
[b35] Samira Ebrahimi Kahou; Vincent Michalski; Adam Atkinson; Ákos Kádár; Adam Trischler; Yoshua Bengio (2017). Figureqa: An annotated figure dataset for visual reasoning. 
[b36] Aishwarya Kamath; Mannat Singh; Yann Lecun; Gabriel Synnaeve; Ishan Misra; Nicolas Carion (2021). Mdetr-modulated detection for end-to-end multi-modal understanding. 
[b37] Sahar Kazemzadeh; Vicente Ordonez; Mark Matten; Tamara Berg (2014-10). ReferItGame: Referring to objects in photographs of natural scenes. 
[b38] Aniruddha Kembhavi; Mike Salvato; Eric Kolve; Minjoon Seo; Hannaneh Hajishirzi; Ali Farhadi (2016). A diagram is worth a dozen images. Springer
[b39] Wonjae Kim; Bokyung Son; Ildoo Kim (2021). Vilt: Vision-and-language transformer without convolution or region supervision. PMLR
[b40] Sotaro Kita (2003). Pointing: Where language, culture, and cognition meet. Psychology Press
[b41] Quan Kong; Ziming Wu; Ziwei Deng; Martin Klinkigt; Bin Tong; Tomokazu Murakami (2019). MMAct: A large-scale dataset for cross modal human action understanding. 
[b42] Philipp Kratzer; Simon Bihlmaier; Niteesh Balachandra Midlagajni; Rohit Prakash; Marc Toussaint; Jim Mainprice (2020). Mogaze: A dataset of full-body motions that includes workspace geometry and eye-gaze. IEEE Robotics and Automation Letters
[b43] Ranjay Krishna; Yuke Zhu; Oliver Groth; Justin Johnson; Kenji Hata; Joshua Kravitz; Stephanie Chen; Yannis Kalantidis; Li-Jia Li; David A Shamma (2017). Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision
[b44] Jae Hee Lee; Matthias Kerzel; Kyra Ahrens; Cornelius Weber; Stefan Wermter (2022). What is right for me is not yet right for you: A dataset for grounding relative directions via multi-task learning. 
[b45] Liunian Harold; Li ; Mark Yatskar; Cho-Jui Da Yin; Kai-Wei Hsieh;  Chang (2019). Visualbert: A simple and performant baseline for vision and language. 
[b46] Paul Pu Liang; Amir Zadeh; Louis-Philippe Morency (2022). Foundations and recent trends in multimodal machine learning: Principles, challenges, and open questions. 
[b47] Tsung-Yi Lin; Michael Maire; Serge Belongie; James Hays; Pietro Perona; Deva Ramanan; Piotr Dollár; C Lawrence; Zitnick  (2014). Microsoft coco: Common objects in context. Springer
[b48] Ulf Liszkowski; Malinda Carpenter; Anne Henning; Tricia Striano; Michael Tomasello (2004). Twelvemonth-olds point to share attention and interest. Developmental science
[b49] Dianbo Liu; Alex Lamb; Kenji Kawaguchi; Anirudh Goyal; Chen Sun; Michael Mozer; Yoshua Bengio (2021). Discrete-valued neural communication in structured architectures enhances generalization. 
[b50] Fangyu Liu; Guy Edward Toh Emerson; Nigel Collier (2022). Visual spatial reasoning. 
[b51] Runtao Liu; Chenxi Liu; Yutong Bai; Alan L Yuille (2019). Clevr-ref+: Diagnosing visual reasoning with referring expressions. 
[b52] Xiang Long; Chuang Gan; Gerard De Melo; Xiao Liu; Yandong Li; Fu Li; Shilei Wen (2018). Multimodal keyless attention fusion for video classification. 
[b53] Ilya Loshchilov; Frank Hutter (2017). Decoupled weight decay regularization. 
[b54] Jiasen Lu; Dhruv Batra; Devi Parikh; Stefan Lee (2019). Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. 
[b55] Haonan Luo; Guosheng Lin; Zichuan Liu; Fayao Liu; Zhenmin Tang; Yazhou Yao (2019). Segeqa: Video segmentation based visual attention for embodied question answering. 
[b56] Mateusz Malinowski; Marcus Rohrbach; Mario Fritz (2017). Ask your neurons: A deep learning approach to visual question answering. International Journal of Computer Vision
[b57] Junhua Mao; Jonathan Huang; Alexander Toshev; Oana Camburu; Alan Yuille; Kevin Murphy (2016). Generation and comprehension of unambiguous object descriptions. 
[b58] Minesh Mathew; Dimosthenis Karatzas;  Jawahar (2021). Docvqa: A dataset for vqa on document images. 
[b59] David Mcneill (2012). How language began: Gesture and speech in human evolution. Cambridge University Press
[b60] Anand Mishra; Shashank Shekhar; Ajeet Kumar Singh; Anirban Chakraborty (2019). Ocr-vqa: Visual question answering by reading text in images. IEEE
[b61] Ferda Ofli; Rizwan Chaudhry; Gregorij Kurillo; René Vidal; Ruzena Bajcsy (2013). Berkeley mhad: A comprehensive multimodal human action database. IEEE
[b62] Adam Paszke; Sam Gross; Francisco Massa; Adam Lerer; James Bradbury; Gregory Chanan; Trevor Killeen; Zeming Lin; Natalia Gimelshein; Luca Antiga (2019). Pytorch: An imperative style, highperformance deep learning library. 
[b63] Jonas Pfeiffer; Andreas Rücklé; Clifton Poth; Aishwarya Kamath; Ivan Vulić; Sebastian Ruder; Kyunghyun Cho; Iryna Gurevych (2020). Adapterhub: A framework for adapting transformers. 
[b64] Jonas Pfeiffer; Ivan Vulić; Iryna Gurevych; Sebastian Ruder (2020). Mad-x: An adapter-based framework for multi-task cross-lingual transfer. 
[b65] Jonas Pfeiffer; Aishwarya Kamath; Andreas Rücklé; Kyunghyun Cho; Iryna Gurevych (2021). Adapterfusion: Non-destructive task composition for transfer learning. 
[b66] Bryan A Plummer; Liwei Wang; Chris M Cervantes; Juan C Caicedo; Julia Hockenmaier; Svetlana Lazebnik (2015). Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. 
[b67] Alec Radford; Jong Wook Kim; Chris Hallacy; Aditya Ramesh; Gabriel Goh; Sandhini Agarwal; Girish Sastry; Amanda Askell; Pamela Mishkin; Jack Clark (2021). Learning transferable visual models from natural language supervision. PMLR
[b68] Mengye Ren; Ryan Kiros; Richard Zemel (2015). Exploring models and data for image question answering. Advances in neural information processing systems
[b69] Andreas Rücklé; Gregor Geigle; Max Glockner; Tilman Beck; Jonas Pfeiffer; Nils Reimers; Iryna Gurevych (2021). Adapterdrop: On the efficiency of adapters in transformers. 
[b70] Michael Scaife; Jerome S Bruner (1975). The capacity for joint visual attention in the infant. Nature
[b71] Boris Schauerte; Gernot A Fink (2010). Focusing computational visual attention in multi-modal humanrobot interaction. Association for Computing Machinery
[b72] Boris Schauerte; Jan Richarz; Gernot A Fink (2010). Saliency-based identification and recognition of pointed-at objects. 
[b73] Dadhichi Shukla; Ozgur Erkent; Justus Piater (2015). Probabilistic detection of pointing directions for human-robot interaction. 
[b74] Dadhichi Shukla; Özgür Erkent; Justus Piater (2016). A multi-view hand gesture rgb-d dataset for human-robot interaction scenarios. 
[b75] Hao Tan; Mohit Bansal (2019-11). LXMERT: Learning cross-modality encoder representations from transformers. 
[b76] Sinan Tan; Weilai Xiang; Huaping Liu; Di Guo; Fuchun Sun (2020). Multi-agent embodied question answering in interactive environments. Springer
[b77] Aaron Van Den; Oriol Oord;  Vinyals (2017). Neural discrete representation learning. Advances in neural information processing systems
[b78] Jette Viethen; Robert Dale (2008). The use of spatial relations in referring expression generation. 
[b79] Limin Wang; Yuanjun Xiong; Zhe Wang; Yu Qiao; Dahua Lin; Xiaoou Tang; Luc Van Gool (2016). Temporal segment networks: Towards good practices for deep action recognition. 
[b80] Peng Wang; Qi Wu; Chunhua Shen; Anton Van Den; Anthony Hengel;  Dick (2015). Explicit knowledgebased reasoning for visual question answering. 
[b81] Peng Wang; Qi Wu; Chunhua Shen; Anthony Dick; Anton Van Den;  Hengel (2017). Fvqa: Fact-based visual question answering. IEEE transactions on pattern analysis and machine intelligence
[b82] Kaiyu Yang; Olga Russakovsky; Jia Deng (2019). Spatialsense: An adversarially crowdsourced benchmark for spatial relation recognition. 
[b83] Sibei Yang; Guanbin Li; Yizhou Yu (2019). Cross-modal relationship inference for grounding referring expressions. 
[b84] Licheng Yu; Eunbyung Park; Alexander C Berg; Tamara L Berg (2015). Visual madlibs: Fill in the blank image generation and question answering. 
[b85] Licheng Yu; Patrick Poirson; Shan Yang; Alexander C Berg; Tamara L Berg (2016). Modeling context in referring expressions. Springer International Publishing
[b86] Licheng Yu; Xinlei Chen; Georgia Gkioxari; Mohit Bansal; Tamara L Berg; Dhruv Batra (2019). Multitarget embodied question answering. 
[b87] Yuke Zhu; Oliver Groth; Michael Bernstein; Li Fei-Fei (2016). Visual7w: Grounded question answering in images. 

Figures:
Figure fig_0: 2
Type: figure
Caption: Figure 2 :2Figure 2: EQA tasks for a sample data from EQA-MX. Top-row: data distribution for each task in EQA-MX (left) and an embodied interaction with multiple visual perspectives (right). Bottom-row: name of the task (left), example questions and answers for the given task based on the visual scene above (middle), and the set of possible answers (right).
Data: 

Figure fig_1: 3
Type: figure
Caption: Figure 3 :3Figure 3: EQA-MX Dataset Analysis: (a) demonstrates varied question lengths in the EQA-MX dataset, indicating differing contextual information across EQA tasks. (b) presents data sample ratios of different verbal perspectives for the perspective grounding (PG) task. (c) depicts object locations in relation to different spatial relations, showing the EQA-MX dataset's non-bias towards verbal and visual perspectives due to inseparable object locations. For detailed analysis, refer to the supplementary materials.
Data: 

Figure fig_2: 4
Type: figure
Caption: Figure 4 :4Figure 4: VQ-Fusion: Vector Quantization (VQ) based multimodal learning model architecture. VQ-Fusion extracts multiview visual representations using visual encoders, which are then discretized using shared codebooks. The shared codebooks' bottleneck allows the model to learn unified concepts across multiple views. Finally, discretized visual representations are fused with discrete verbal representations to produce multimodal representation.
Data: 

Figure fig_3: 
Type: figure
Caption: Figure 6: Distributions of task outputs in the existence prediction (EP), object attribute compare (OAC), and relation grounding (RG) tasks. All these tasks have balanced binary outputs
Data: 

Figure fig_4: 7
Type: figure
Caption: Figure 7 :7Figure7: A verbal expression Wordcloud for the EQA-MX dataset, as well as the output distribution for the object grounding (OG) and perspective-aware object grounding (POG) tasks. In the Wordcloud the size of words represents the frequencies that they occur in the verbal utterances. Therefore, the most frequent words describe general properties of objects or are general words inside questions -such as color, perspective, and spatial relations/locations. In the diagrams for object frequencies for the object grounding and perspective-aware object grounding tasks, the most referred objects all have the same frequencies (these tasks have the same object distributions). Expr. : expression, Dist. : distribution.
Data: 

Figure fig_5: 
Type: figure
Caption: Figure8: Distribution of task outputs in the object counting and object attribute compare tasks. Both distributions are not completely even due to different observed scene probabilities. For the object counting (OC) task, lower numbers have higher probabilities of occurring due to the number of objects in the scene ranging from 4 -10, hence the imbalance in distributions. Similarly, in the object attribute compare task different object colors are queried for, and since the colors of objects is not completely balanced, the task distribution is imbalanced. Dist. = Distribution.
Data: 

Figure fig_6: 9
Type: figure
Caption: Figure 9 :9Figure 9: Object locations visualized for different spatial relations/locations across the EQA-MX dataset. The object locations are not easily separable based on spatial relations/locations that vary based on perspectives. (a & b) demonstrates how the shelf environment has more non-separable locations/relations due to the fact that verbal perspective in the shelf environment does not vary based on visual perspective. c is generally linearly separable, as expected, as the center of a given scene is objective. d demonstrates how opposing corners (i.e. front left and back right) are nonseparable due to varying based on verbal perspectives).
Data: 

Figure : 
Type: figure
Caption: 
Data: 

Figure tab_0: 2
Type: table
Caption: EQA-MX dataset splits for 8 EQA tasks.
Data: Splits EPOG POG OC OAQ OAC PG RGTrain 1060k 1060k 1060k 1060k 1060k 218k 785k 349kValid 126k 126k 126k 126k 126k 27k 93k 41kTest 126k 126k 126k 126k 126k 28k 93k 42k

Figure tab_1: 3
Type: table
Caption: Comparisons of VL models performance for EQA tasks. The results suggest that incorporating VQ-Fusion in VL models can improve the performance of EQA tasks. ✓: VL models with VQ-Fusion, and ✗: VL models without VQ-Fusion.
Data: Models✗EP✓✗OG✓✗POG✓✗OC✓Dual Encoder 53.46 55.78 48.31 49.96 83.91 84.28 12.28 12.38CLIP53.17 54.72 54.06 65.49 70.92 82.70 09.65 13.14VisualBERT50.00 54.51 53.39 54.50 86.09 87.09 14.09 14.35ViLT90.24 91.50 59.74 61.04 86.10 87.42 11.14 12.54Models✗OAQ✓✗OAC✓✗PG✓✗RG✓Dual Encoder 63.71 66.90 57.92 61.45 66.72 66.77 75.78 89.36CLIP70.85 74.32 58.59 70.59 66.64 66.99 85.84 89.93VisualBERT51.43 54.45 58.56 59.98 66.37 79.11 89.13 89.26ViLT55.96 59.47 58.93 60.16 80.36 81.23 87.36 88.68

Figure tab_2: 4
Type: table
Caption: Impact of gaze (G) and pointing gestures (PG) in learning EQA tasks. The results suggest that incorporating gestures improves EQA task performance. G (✗) and PG (✗) indicate visual scenes that do not include humans.
Data: G PGEPEQA Tasks OG POG OC OAQ OAC PGRG✗ ✗ 51.03 26.65 52.79 09.94 24.01 51.22 48.95 56.75✗ ✓ 53.87 60.66 71.08 11.51 64.69 60.63 66.31 90.01✓ ✗ 53.51 63.49 70.90 12.29 69.43 61.25 66.67 87.23✓ ✓ 54.38 68.61 79.68 11.86 72.62 60.74 66.68 89.59

Figure tab_3: 5
Type: table
Caption: Impact of the number of VQ codebooks (VQ CBs) in VQ-Fusion with the CLIP model in learning EQA tasks.
Data: VQEQA TasksCBsEPOG POG OC OAQ OAC PGRG2 53.46 64.86 82.70 13.14 61.39 57.43 61.39 88.244 52.15 61.12 73.94 11.35 69.42 70.59 60.30 89.938 54.72 65.49 73.97 11.92 70.85 60.68 66.82 88.2316 53.19 55.12 71.32 11.43 69.35 60.37 66.99 84.36

Figure tab_5: 6
Type: table
Caption: We trained CLIP models with VQ-Fusion using different combinations of modalities on the 8 tasks described in Figure2in the paper. Top Table:only verbal questions. Bottom Table:different visual modalities and verbal questions. The results suggest that multimodal models outperform those using only verbal data (Top Table). Additionally, training models with multiview data leads to robust performance, while using a subset of views results in performance degradation if the views change during testing (Bottom Table). Existence Prediction (EP), Object Grounding (OG), 89.93 ALL Ego 54.32 60.63 82.31 12.22 69.84 60.89 66.71 89.03 ALL Exo 54.17 59.14 78.02 12.55 61.71 62.25 66.53 89.26
Data: Aware Object Grounding (POG), Object Counting (OC), Object Attribute Query (OAQ), ObjectAttribute Compare (OAC), Perspective Grounding (PG), Relation Grounding (RG).OnlyVerbalEP 40.64 8.90 45.46 7.45 7.69 29.49 45.23 44.82 OG POG OC OAQ OAC PG RGTrain TestEPOGPOGOCOAQ OACPGRGEgoEgo 53.86 59.92 70.98 10.60 68.56 61.86 64.41 87.54EgoExo 52.61 17.28 62.45 8.96 15.06 56.62 63.39 82.33ExoExo 53.67 39.46 69.96 11.24 56.76 60.20 66.39 88.58ExoEgo 52.84 21.39 69.70 10.78 25.03 58.68 64.49 88.20ALL ALL 54

Figure tab_6: 
Type: table
Caption: The MMAct dataset comprises 37 common daily life activities, each performed by 20 individuals and repeated five times. The dataset includes seven modalities, ranging from RGB data to acceleration and gyroscope measurements. Our experiments focused on utilizing two available viewpoints of RGB videos, as well as acceleration, gyroscope, and orientation data. Notably, the MMAct dataset also includes visually occluded data samples, providing an opportunity to evaluate the effectiveness of multimodal learning approaches in extracting complementary features for activity recognition.
Data: EPOGPOGOCOAQ OACPGRG54.72 65.49 82.70 13.14 74.32 70.59 66.99 89.93MTEP 53.25 40.76 OGEP 52.68 73.90 POGEP 52.62 49.86 PGMTEP 54.24 68.70 55.56 OAQ OGEP 53.17 66.92 66.61 PG OAQPG 66.80 53.26 69.01 EQ OAQet al., 2019).

Figure tab_7: 8
Type: table
Caption: 
Data: : Cross-session performance comparison (F1-Score) of multimodal learning methods onMMAct datasetMethodF1-Score (%)SVM+HOG

Figure tab_8: 9
Type: table
Caption: Comparison of the QA datasets. Existing VQA and EQA datasets do not contain nonverbal human gestures (NV), multiple verbal perspectives (MV), contrastive (C) and ambiguous (A) data samples. ‡ Embodied (E) interactions refer to humans interacting with multimodal expressions. † Embodied interactions refer to an agent navigating in an environment. V: Verbal and MT: Multitasks.
Data: DatasetsV NV E EQA MT MVViews Exo Ego TopC APointAt


Formulas:
Formula formula_0: ✓ ✗ ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ 2k 6k DocVQA

Formula formula_1: ✓ ✗ ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ 12k 50k GRiD-3D

Formula formula_2: ✓ ✗ ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ 650k 650k EQA † (Das et al., 2018a) ✓ ✗ ✓ † ✓ † ✗ ✗ ✓ † ✗ ✗ ✗ 5k 5k MT-EQA † (Das et al., 2018a) ✓ ✗ ✓ † ✓ † ✗ ✗ ✓ † ✗ ✗ ✗ 19k 19k EQA-MX ‡ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 750k 8,243k

Formula formula_3: E m = F m (X m ), m ∈ (ego

Formula formula_4: E m ∈ (E ego , E exo , E top ).

Formula formula_5: e (m,oi) = F D (s (m,i) ), o i = arg max j∈1...L ||s (m,i) -c j || .

Formula formula_6: E D m = CONCAT(F D (s (m,1) ), . . . , F D (s (m,G) ))

Formula formula_7: L V Q align = β G G i ||s i -sg(c oi )|| 2 2 .

Formula formula_8: E f used = m∈M α m E m . Here, α m = exp(γm) m∈M exp(γm) and γ m = (W ) T E m , m ∈ M .

Formula formula_9: T k : y T k = F T k (E f used ).

Formula formula_10: T k : L task,T k (y T k , ŷT k ) = 1 B B i=1 y (T k ,i) log ŷ(T k ,i)

Formula formula_11: L = W V Q L V Q align + W task L task,T k .

Formula formula_12: ✗ ✓ ✓ ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ ReferAt (Schauerte & Fink, 2010) ✓ ✓ ✓ ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ IPO

Formula formula_13: ✗ ✓ ✓ ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ RefIt (Kazemzadeh et al., 2014) ✓ ✗ ✗ ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ RefCOCO (Yu et al., 2016) ✓ ✗ ✗ ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ RefCOCO+ (Yu et al., 2016) ✓ ✗ ✗ ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ RefCOCOg (Mao et al., 2016) ✓ ✗ ✗ ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ Flickr30k (Plummer et al., 2015) ✓ ✗ ✗ ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ GuessWhat? (De Vries et al., 2017) ✓ ✗ ✗ ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ Cops-Ref (Chen et al., 2020b) ✓ ✗ ✗ ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ CLEVR-Ref+ (Liu et al., 2019) ✓ ✗ ✗ ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✗ DAQUAR (Malinowski et al., 2017) ✓ ✗ ✗ ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ FM-IQA

Formula formula_14: ✓ ✗ ✗ ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✗ DVQA (Kafle et al., 2018) ✓ ✗ ✗ ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✗ VQA (COCO)

Formula formula_15: ✗ ✗ ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✗ Visual 7W (Zhu et al., 2016) ✓ ✗ ✗ ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✗ KB-VQA (Wang et al., 2015) ✓ ✗ ✗ ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✗ FBQA (Wang et al., 2017) ✓ ✗ ✗ ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✗ VQA-MED (Hasan et al., 2018) ✓ ✗ ✗ ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ DocVQA (Mathew et al., 2021) ✓ ✗ ✗ ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✗ YouRefIt (Chen et al., 2021) ✓ ✓ ✓ ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✗ GRiD-3D (Lee et al., 2022) ✓ ✗ ✗ ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✗ EQA † (Das et al., 2018a) ✓ ✗ ✗ ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✗ MT-EQA † (Das et al., 2018a) ✓ ✗ ✗ ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✗ CAESAR-L (Islam et al., 2022b) ✓ ✓ ✓ ✗ ✗ ✓ ✓ ✓ ✓ ✓ ✓ CAESAR-XL (Islam et al., 2022b) ✓ ✓ ✓ ✗ ✗ ✓ ✓ ✓ ✓ ✓ ✓ EQA-MX ‡ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
