Abstract: Utilizing Unmanned Aerial Vehicles (UAVs) for visual object search tasks in indoor or urban environments is a focal issue in the current research of embodied intelligence. However, significant challenges remain, including difficulties for agents in comprehensively understanding the surrounding environment and insufficient levels of intelligence in task planning and execution. This study proposes a Multimodal Large Language Model (MLLM)-driven Human-AI collaborative UAV visual object search framework based on the web platform. Specifically, we develop a web platform for leveraging its ease of interaction, collaboration, and accessibility to enable online access to MLLM-based agents and facilitate real-time human-AI cooperation. Additionally, an online task planning method for MLLM-based agents and a dialogic human-AI collaboration approach based on web crowdsourcing is proposed to enhance the effectiveness of visual object search task execution. This research emphasizes the crucial bridging role of web engineering in the collaboration between LLMs and human-AI systems, contributing to the interdisciplinary integration of artificial intelligence, embodied intelligence, and web engineering.
External IDs:dblp:conf/icwe/JiQZJ25
Loading