Keywords: contrastive learning, object recognition, temporal coherence
TL;DR: We investigate a form of contrastive learning that maps temporally close views of an object on nearby latent representations.
Abstract: Human infants learn to recognize objects largely without supervision. In machine learning, contrastive learning has emerged as a powerful form of unsupervised representation learning. The utility of learned representations for downstream tasks depends strongly on the chosen augmentation operations. Taking inspiration from biology, we here study a framework for unsupervised learning of object representations we call Contrastive Learning Through Time (CLTT). CLTT simulates viewing sequences as they might be experienced by an infant while interacting with objects and avoids arbitrary augmentation operations. Instead, positive pairs are formed by successive views in such unsegmented viewing sequences. Generating viewing sequences procedurally, rather than using natural videos, gives us perfect control over the temporal structure of the input and allows us to ask the following two questions. First, can CLTT approach the performance of fully supervised learning? Second, if so, what are the required conditions on the temporal structure of the input? To answer these questions, we develop a new data set using a near-photorealistic training environment based on ThreeDWorld (TDW). We consider several state-of-the-art contrastive learning methods and demonstrate that CLTT allows linear classification performance that approaches that of the fully supervised setting if subsequent views are sufficiently likely to stem from the same object. We also consider the effect of one object being seen systematically before or after another object. We show that this leads to increased representational similarity between these objects, reminiscent of classic neurobiological findings. The data sets and code for this paper can be downloaded at: https://www.github.com/trieschlab/CLTT.