- Abstract: Learning object dynamics for model-based control usually involves choosing among two alternatives: i) engineered 3D state representations comprised of 3D object locations and poses, or, ii) learnt 2D image representations trained end-to-end for the dynamics prediction task. The former requires laborious human annotations to extract the 3D information from 2D images, and does not permit end-to-end learning. The latter has not shown until today to generalize across camera viewpoints or to handle camera motion and cross-object occlusions. We propose neural architectures that learn to disentangle an RGB-D video steam into camera motion and 3D scene appearance, and capture the latter into 3D feature representations that can be trained end-to-end with 3D object detection and object motion forecasting. We feed object-centric 3D feature maps and actions of the agent into differentiable neural modules and learn to forecast object 3D motion. We empirically demonstrate the proposed 3D representations learn object dynamics that generalize across camera viewpoints and can handle object occlusions. They do not suffer from error accumulation when unrolled over time thanks to the permanence of object appearance in 3D. They outperform by a margin both 2D learned image representations as well as engineered 3D ones in forecasting object dynamics.