MM-IGLU-IT: Multi-modal Interactive Grounded Language Understanding in Italian

Federico Borazio, Claudiu Daniel Hromei, Elisa Passone, Danilo Croce, Roberto Basili

Published: 01 Jan 2024, Last Modified: 22 Feb 2025AI*IA 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: This paper explores Interactive Grounded Language Understanding (IGLU) within Human-Robot Interaction (HRI). Here, a robot interprets user commands related to its environment, determining if a specific command can be executed. When ambiguities or incomplete data arise, the robot asks relevant clarification questions. Current models, trained on English datasets, leverage multi-modal and end-to-end capabilities by fine-tuning architectures like LLaVA. These models combine a Visual Encoder, processing images of the environment, with Large Language Models (LLMs) encoding user requests, enabling agents to discern command executability and seek clarifications when necessary. While many LLMs are inherently multi-lingual, fine-tuning them on English-only datasets limits their application in other languages, such as Italian. To address this, we developed MM-IGLU-IT, a dataset for Multi-Modal Interactive Grounded Language Understanding in Italian. This dataset was created by automatically translating existing large-scale datasets and manually validating them for accuracy, resulting in over 6,800 command examples. Training a model like LLaVA, fine-tuned over a multi-lingual base model such as LLaMA2, allowed us to achieve comparable performance in both English and Italian. This resource is released on a dedicated GitHub page at https://github.com/crux82/MM-IGLU-IT and we hope it will advance multi-modal models in the Italian language, providing a valuable resource for ongoing research.