Abstract: We propose an approach to automatically extract the 3D pose of dogs from single-view RGB images using only synthetic data for training. Due to the lack of suitable 3D datasets, previous approaches have predominantly relied on 2D weakly supervised methods. While these approaches demonstrate promising results, some depth ambiguities still persist indicating the neural network's limited understanding of the 3D environment. To tackle these depth ambiguities, we generate a synthetic 3D pose dataset (DigiDogs) by modifying the popular video game Grand Theft Auto. Additionally, to address the domain gap between synthetic and real data, we harness the power of Meta's foundation model DINOv2 due to its generalisation capability and fine-tune it for the application of 3D pose estimation. Through a combination of qualitative and quantitative analyses, we demonstrate the viability of estimating the 3D pose of dogs from real-world images using synthetic training data.
Loading