Keywords: Geometric Chain-of-Thought Reasoning, Camera Trajectory Estimation, Vision-Language Models
TL;DR: A VLM trained to estimate camera pose trajectories through step-by-step geometric reasoning, achieving 6× better translation accuracy than Gemini 3.1 Pro.
Abstract: Embodied agents require metric spatial self-localization to act effectively, yet vision-language models (VLMs)---increasingly used as foundation models for such agents---consistently struggle with metric camera geometry.
We present EgoReasoner, a two-stage framework that trains VLMs to estimate camera pose trajectories via geometric chain-of-thought reasoning, combining supervised fine-tuning with GRPO reinforcement learning on geometry-grounded rewards.
Trained on 684,455 frames from RealEstate10K, our model achieves 3.2 times and $1.7 times higher structured output parse rates over a base 8B VLM and Gemini 3.1 Pro, and surpasses Gemini in translation accuracy by over 6 times.
These results demonstrate that structured training can instill metric geometric reasoning in VLMs, advancing them toward spatially-aware foundation models for embodied agents.
Submission Number: 6
Loading