Keywords: unsupervised visual learning, variational inference, multi-agent RL, representation learning, object-centric learning, imitation learning from observation, slot attention
TL;DR: We learn agent-centric representations from pixels via variational inference over latent actions, achieving strong generalization across novel agents/goals/environments and emergent cognitive properties without supervision.
Abstract: We introduce Variational Agent Discovery (VAD), an unsupervised agent representation learning algorithm that discovers agent-centric representations directly from pixels. We frame agent representation learning as a prediction problem where we aim to predict what latent actions describe transitions of latent variables used to model a scene. VAD leverages slot-based attention with a variational objective that jointly learns inverse dynamics (inferring actions from transitions), forward dynamics (predicting states from actions), and agent policies (distributions over actions). Without any supervision, VAD develops representations that generalize to novel agents and goals with minimal performance degradation. Our learned representations enable downstream tasks like action prediction and goal inference. Notably, VAD exhibits shared action representations across multiple observed agents—feature dimensions that consistently activate for the same action regardless of which agent performs it—and demonstrates teleological reasoning capabilities similar to 12-month-old infants, suggesting that these cognitive phenomena can emerge from our unsupervised agent representation learning objective.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 10823
Loading