Keywords: Representation Learning, Variational Autoencoder, Unsupervised Learning, Deep-Learning, Registration
TL;DR: We introduce a model-architecture based on conditional VAEs that can learn time-varying features, as for example image-transformations, efficiently.
Abstract: We present an architecture based on the conditional Variational Autoencoder to learn a representation
of transformations in time-sequence data. The model is constructed in a way that allows to identify sub-spaces of features indicating changes between frames without learning features that are constant within a time-sequence. Therefore, the approach disentangles content from transformations. Different model-architectures are applied to affine image-transformations on MNIST as well as a car-racing video-game task.
Results show that the model discovers relevant parameterizations, however, model architecture has a major impact on the feature-space. It turns out, that there is an advantage of only learning features describing change of state between images, over learning the states of the images at each frame. In this case, we do not only achieve higher accuracy but also more interpretable linear features. Our results also uncover the need for model architectures that combine global transformations with convolutional architectures.
Code: https://drive.google.com/drive/folders/1euBPj9DMHlHz6ueuHtiWrqU2KsKXfpVq?usp=sharing
Original Pdf: pdf
8 Replies
Loading