Keywords: 3D scene representation, Single-view inference, NeRF, Reinforcement Learning
Abstract: Reinforcement learning (RL) has enabled robots to develop complex skills, but its success in image-based tasks often depends on effective representation learning. Prior works have primarily focused on 2D representations, often overlooking the inherent 3D geometric structure of the world, or have attempted to learn 3D representations that require extensive resources such as synchronized multi-view images even during deployment. To address these issues, we propose a novel RL framework that extracts 3D-aware representations from single-view RGB input, without requiring camera calibration information or synchronized multi-view images during the downstream RL. Our method employs an autoencoder architecture, using a masked ViT as the encoder and a latent-conditioned NeRF as the decoder, trained with cross-view completion to capture fine-grained, 3D geometry-aware representations. Additionally, we utilize a time contrastive loss that further regularizes the learned representation for consistency across different viewpoints. Our method significantly enhances the RL agent’s performance in complex tasks, demonstrating superior effectiveness compared to prior 3D representation-based methods, even when using only a single, uncalibrated camera during deployment.
Supplementary Material: zip
Primary Area: applications to robotics, autonomy, planning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 8825
Loading