Keywords: Symbolic autoencoding, self-supervised learning, discrete auto-encoding, discrete representation learning, straight-though gradient estimation
TL;DR: A paradigm based on straight through gradient estimations for connecting sequence-to-sequence models and training them end-to-end, similar to an auto-encoder where the hidden representation is discrete and sequential, i.e. sentences from a language.
Abstract: Traditional language models(LMs) excel at next-token prediction in text sequences but often struggle with transduction
tasks involving distinct symbolic systems, particularly when parallel data is scarce or nonexistent. This issue is
even more pronounced in domains dealing with complex, non-natural language sequences, such as audio signals,
protein structures, or biological sequences, where the strengths of LMs in natural language do not directly translate.
To address this challenge, we introduce symbolic autoencoding ($\Sigma$AE), a self-supervised framework
designed to exploit the wealth of non-parallel data alongside limited parallel data. $\Sigma$AE integrates two
generative models via a discrete bottleneck layer, optimizing the entire system end-to-end by minimizing unsupervised
reconstruction loss for all data such that the sequence generated at the discrete bottleneck can be read out as the
transduced input sequence, and separately optimizing the two models with supervised loss on the subset of labeled
parallel data. To allow optimization of the models in the presence of discrete symbols, we use a family of straight-through gradient
estimators. We demonstrate the effectiveness of $\Sigma$AE on four sequence-to-sequence
transduction tasks, showing that it significantly outperforms strong baselines in weakly supervised settings.
Submission Number: 32
Loading