Mitigating Mode Collapse in Sequential Disentanglement via an Architecture Bias

23 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Unsupervised Learning, Sequential Disentanglement
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: One of the fundamental representation learning tasks is unsupervised sequential disentanglement, where latent representations of the inputs are decomposed to a single static factor and a sequence of dynamic factors. To extract this latent information, existing variational methods condition the static and dynamic codes on the entire input sequence. Unfortunately, these models often suffer from mode collapse, i.e., the dynamic vectors encode static and dynamic information, leading to a non-meaningful static component. Attempts to alleviate this problem via reducing the dynamic dimension and mutual information loss terms gain only partial success. Often, promoting a certain functionality of the model is better achieved via specific architectural biases instead of incorporating additional loss terms. For instance, convolutional nets gain translation-invariance with shared kernels and attention models realize the underlying correspondence between source and target sentences. Inspired by these successes, we propose in this work a novel model that mitigates mode collapse by conditioning the static component on a single sample from the sequence, and subtracting the resulting code from the dynamic factors. Remarkably, our variational model has less hyper-parameters in comparison to existing work, and it facilitates the analysis and visualization of disentangled latent data. We evaluate our work on multiple data-modality benchmarks including general time series, video, and audio, and we show beyond state-of-the-art results on generation and prediction tasks in comparison to several strong baselines.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 7066
Loading