Keywords: predictive coding, path integration, SLAM, self-supervised learning
TL;DR: A neural network that traverses an environment and predicts visual observations represents its spatial position—without supervision of its own kinematics or the environment's coordinates.
Abstract: Humans navigate complex environments using only visual cues and self-motion. Mapping an environment is an essential task for navigation within a physical space; neuroscientists and cognitive scientists also postulate that mapping algorithms underlie cognition by mapping concepts, memories, and other nonspatial variables. Despite the broad importance of mapping algorithms in neuroscience, it is not clear how neural networks can build spatial maps exclusively from sensor observations without access to the environment’s coordinates through reinforcement learning or supervised learning. Path integration, for example, implicitly needs the environment’s coordinates to predict how past velocities translate into the current position. Here we show that predicting sensory observations—called predictive coding—extends path integration from implicitly requiring the environment’s coordinates. Specifically, a neural network constructs an environmental map in its latent space by predicting visual input. As the network traverses complex environments in Minecraft, spatial proximity between object positions affects distances in the network's latent space. The relationship depends on the uniqueness of the environment’s visual scene as measured by the mutual information between the images and spatial position. Predictive coding extends to any sequential dataset. Observations from paths traversing a manifold can generate such sequential data. We anticipate neural networks that perform predictive coding identify the underlying manifold without requiring the manifold’s coordinates.
In-person Presentation: yes