Y-MAP-Net: Learning from Foundation Models for Real-Time, Multi-Task Scene Perception

Published: 02 Jun 2026, Last Modified: 02 Jun 2026Greeks in AI 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: real-time, monocular, depth estimation, normal estimation, semantic segmentation, image captioning, multi-task learning
Domains: Vision and Learning, Robotics
TL;DR: Y-MAP-Net distills 5 foundation models into one real-time CNN, simultaneously estimating depth, normals, pose, segmentation & captions from RGB. It is the first method to perform all these tasks in real-time. Accepted as oral presentation @ ICRA 2026
External Link: https://ras.papercept.net/conferences/conferences/ICRA26/program/ICRA26_ContentListWeb_5.html#that3_05
Abstract: We present Y-MAP-Net, a Y-shaped neural network architecture designed for real-time multi-task learning on RGB images. Y-MAP-Net simultaneously predicts depth, surface normals, human pose, semantic segmentation, and generates multi-label captions in a single forward pass. To achieve this, we adopt a multi-teacher, single-student training paradigm, where task-specific foundation models supervise the learning of the network, allowing it to distill their capabilities into a unified real-time inference architecture. Y-MAP-Net exhibits strong generalization, architectural simplicity, and computational efficiency, making it well-suited for resource-constrained robotic platforms. By providing rich 3D, semantic, and contextual scene understanding from low-cost RGB cameras, Y-MAP-Net supports key robotic capabilities such as object manipulation and human–robot interaction. To encourage future research and reproducibility, we make our code publicly available in https://github.com/FORTH-ICS-CVRL-HCCV/Y-MAP-Net .
Submission Number: 109
Loading