Keywords: Identifiability, Representation learning, Language embedding, CEBRA, Contrastive learning
Abstract: This work aims to identify the latent structure of high-dimensional language embeddings by applying a theoretically grounded framework for identifiable representation learning.
Prior studies have shown that linear ICA can transform embeddings into spaces with semantically meaningful axes.
As a natural extension, nonlinear ICA has been proposed to recover the latent structure generated through nonlinear mixing in the data-generating process.
We adopt CEBRA, a contrastive learning framework grounded in nonlinear ICA theory,
which ensures identifiability of the latent structure up to linear transformations by leveraging auxiliary variables.
In our preliminary experiments on an emotion-labeled text dataset, where we use emotion labels as auxiliary variables,
the resulting CEBRA embeddings form a low-dimensional space that exhibits linear separability.
Moreover, across random initializations, the learned embeddings exhibit consistency up to linear transformations, empirically supporting practical identifiability of the learned representation.
We discuss open questions regarding the interpretation of the label-related latent representations and future directions, including their potential alignment with human neural processing.
Poster Pdf: pdf
Submission Number: 161
Loading