Abstract: We introduce CAPA, a parameter-efficient test-time optimization framework that adapts pre-trained 3D foundation models (FMs) for depth completion, using sparse geometric cues. Unlike prior methods that train task-specific encoders for auxiliary inputs, which risk degrading the pre-trained prior and limit generalization to new scenes or sensor configurations, CAPA freezes the FM backbone to preserve its strong geometric prior. It updates only a small set of parameters using Parameter-Efficient Fine-Tuning (e.g., LoRA or VPT), guided directly by gradients calculated from the sparse observations available at inference. This approach effectively grounds the foundation model's geometric prior in the scene-specific measurements, correcting distortions and misplaced structures. For videos, CAPA introduces sequence-level parameter sharing, jointly adapting all frames to exploit temporal correlations, improve robustness, and enforce multi-frame consistency. CAPA is model-agnostic, compatible with ViT-based depth/multi-view FMs, and achieves state-of-the-art results across diverse condition patterns on both indoor and outdoor datasets.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Alex_Wong2
Submission Number: 8001
Loading