Abstract: Learning robust and effective representations of visual
data is a fundamental task in computer vision. Traditionally, this is achieved by training models with labeled data
which can be expensive to obtain. Self-supervised learning attempts to circumvent the requirement for labeled data
by learning representations from raw unlabeled visual data
alone. However, unlike humans who obtain rich 3D information from their binocular vision and through motion,
the majority of current self-supervised methods are tasked
with learning from monocular 2D image collections. This is
noteworthy as it has been demonstrated that shape-centric
visual processing is more robust compared to texture-biased
automated methods. Inspired by this, we propose a new
approach for strengthening existing self-supervised methods by explicitly enforcing a strong 3D structural prior directly into the model during training. Through experiments,
across a range of datasets, we demonstrate that our 3D
aware representations are more robust compared to conventional self-supervised baselines.
Loading