Recursive Autoregressive Depth Estimation with Continuous Token Modeling

10 Sept 2025 (modified: 21 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Autoregressive, Depth Estimation
Abstract: Monocular depth estimation is a cornerstone of robotic perception and computer vision, yet reconstructing 3-D structure from a single RGB image suffers from severe geometric ambiguity and uncertainty. Motivated by the recent success of autoregressive (AR) models in image generation, we introduce a Fractal Visual AR + Diffusion framework that predicts depth both accurately and efficiently. Conventional pixel-wise AR generation is too slow for robotic applications, so we design a coarse-to-fine, multi-scale autoregressive pipeline: the model first sketches a global depth map at low resolution and then refines it progressively to full pixel fidelity, greatly accelerating inference. To bridge the RGB–Depth modality gap, each scale incorporates a Visual-Conditioned Feature Refinement (VCFR) module that fuses multi-scale image features with the current depth prediction, explicitly injecting geometric and textural cues. Because discretising continuous depth values can cause information loss and unstable training, we adopt a conditional denoising diffusion loss that models depth distributions directly in continuous latent space, fundamentally avoiding quantisation errors. Although the visual AR–diffusion paradigm boosts accuracy, its layer-by-layer generation still introduces latency. To reclaim speed, we abstract the Visual AR unit into a reusable base generator and invoke it recursively, forming a self-similar fractal architecture that preserves modelling power while cutting the inference path.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 3600
Loading