OMNINAV: A UNIFIED FRAMEWORK FOR PROSPEC- TIVE EXPLORATION AND VISUAL-LANGUAGE NAVI- GATION

Abstract

Embodied navigation is a foundational challenge for intelligent robots, demanding the ability to comprehend visual environments, follow natural language instructions, and explore autonomously. However, existing models struggle to provide a unified solution across heterogeneous navigation paradigms, often yielding low success rates and limited generalization. We present OmniNav, a unified framework that handles instruct-goal, object-goal, point-goal navigation, and frontier-based exploration within a single architecture. First, we introduce a lightweight, low-latency policy that predicts continuous-space waypoints (coordinates and orientations) with high accuracy, outperforming action-chunk methods in precision and supporting real-world deployment with control frequencies up to 5 Hz. Second, at the architectural level, OmniNav proposes a fast-slow system design: a fast module performs waypoint generation from relatively short-horizon visual context and subtasks, while a slow module conducts deliberative planning using long-horizon observations and candidate frontiers to select the next subgoal and subtask. This collaboration improves path efficiency and maintains trajectory coherence in exploration and memory-intensive settings. Notably, we find that the primary bottleneck lies not in navigation policy learning per se, but in robust understanding of general instructions and objects. To enhance generalization, we incorporate large-scale general-purpose training dataset including those used for image captioning and visual into a joint multi-task regimen, which substantially boosts success rates and robustness. Extensive experiments demonstrate state-of-the-art performance across diverse navigation benchmarks, and real-world deployment further validates the approach. OmniNav offers practical insights for embodied navigation and points to a scalable path toward versatile, highly generalizable robotic intelligence.

Cot reasoning by the slow thinking system for exploration. For the “find the bathtub” task, the model reasons over the frontier set using memory and semantic priors (e.g., bathrooms are more likely near bedrooms and away from dining areas), iteratively generating subgoals for the fast system to execute.

Real-World Deployment

multiview video

obj_goal:Find a girl wearing a pink T-shirt

obj-goal_long:Get out of the room and find me a water dispenser.

obj-goal_short:I want to take out the trash (find a trash can)

instruct-goal:Go into the first room on the left and find a chair

obj-goal:find a trash can

obj-goal:Find a sofa

instruct-goal:Go forward to the first intersection, then turn left and find a trash can and park in front of it

point-goal: Avoid sofas and people

obj-goal:Find a vending machine

obj-goal: Avoid chairs in narrow spaces

instruct-goal:Go into the first room on the right and then find a boy. Stop in front of him.

point-goal:Local obstacle avoidance

Performance of the Slow System in Simulation

Explore the area until you locate a dishwasher. Stop when you've reached its location

Could you help me find a plant? Show me the way

Qualitative results between baselines and our approach

woord woord