Language meets YOLOv8 for metric monocular SLAM

José Martínez-Carranza; Delia Irazú Hernández Farías; Leticia Oyuki Rojas-Perez; Aldrich A. Cabrera-Ponce

Language meets YOLOv8 for metric monocular SLAM

José Martínez-Carranza, Delia Irazú Hernández Farías, Leticia Oyuki Rojas-Perez, Aldrich A. Cabrera-Ponce

Published: 01 Jan 2023, Last Modified: 24 Apr 2024J. Real Time Image Process. 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: We present a new approach that combines spoken language and visual object detection to produce a depth image to perform metric monocular SLAM in real time and without requiring a depth or stereo camera. We propose a methodology where a compact matrix representation of the language and objects, along with a partitioning algorithm, is used to resolve the association between the objects mentioned in the spoken description and the objects visually detected in the image. The spoken language is processed online using Whisper, a popular automatic speech recognition system, while the YOLOv8 network is used for object detection. Camera pose estimation and mapping of the scene are performed using ORB-SLAM. The full system runs in real time, allowing a user to explore the scene with a handheld camera, observe the objects detected by YOLOv8, and provide depth information of these objects with respect to the camera via a spoken description. We have performed experiments in indoor and outdoor scenarios, comparing the resulting camera trajectory and map obtained with our approach against that obtained when using RGB-D images. Our results are comparable to those obtained with the latter without losing real-time performance.

Loading