Abstract: We introduce Variational Joint Embedding (VJE), a reconstruction-free latent-variable framework for non-contrastive self-supervised learning in representation space. VJE maximizes a symmetric conditional evidence lower bound (ELBO) on paired encoder embeddings by defining a conditional likelihood directly on target representations, rather than optimizing a pointwise compatibility objective. The likelihood is instantiated as a heavy-tailed Student--\(t\) distribution on a polar representation of the target embedding, where a directional--radial decomposition separates angular agreement from magnitude consistency and mitigates norm-induced pathologies. The directional factor operates on the unit sphere, yielding a valid variational bound for the associated spherical subdensity model. An amortized inference network parameterizes a diagonal Gaussian posterior whose feature-wise variances are shared with the directional likelihood, yielding anisotropic uncertainty without auxiliary projection heads. Across ImageNet-1K, CIFAR-10/100, and STL-10, VJE is competitive with standard non-contrastive baselines under linear and \(k\)-NN evaluation, while providing probabilistic semantics directly in representation space for downstream uncertainty-aware applications. We validate these semantics through out-of-distribution detection, where representation-space likelihoods yield strong empirical performance. These results position the framework as a principled variational formulation of non-contrastive learning, in which structured feature-wise uncertainty is represented directly in the learned embedding space.
Submission Type: Long submission (more than 12 pages of main content)
Changes Since Last Submission: R1 - Reformulated the directional construction intrinsically on the sphere, with a normalized reference model and a subnormalized directional term used in training; the ELBO is now stated for the associated subprobability model. All carried-over experiments were rerun from scratch under the revised formulation rather than being retained from earlier versions. Additionally:
- Clarified fixed-observation semantics and the use of stop-gradient versus EMA target encoders.
- Replaced the main uncertainty experiment with OOD detection on CIFAR-10 across multiple near- and far-OOD datasets, based on the OpenOOD benchmark.
- Added posterior-geometry analysis, including visualizations and quantitative margin/radius diagnostics.
- Expanded the empirical evaluation across ImageNet-1K, CIFAR-10, CIFAR-100, and STL-10, including stop-gradient and EMA variants.
- Added reproduced VI-SimSiam comparisons and broader OOD benchmark context.
- Added supplementary diagnostics and ablations, including Monte Carlo sampling, isotropic variance failure, factorized vs. standard likelihood analysis, and compute profiling.
- Expanded implementation details, reproducibility notes, and limitations/future directions.
R2 - No material changes; updates are purely editorial:
- Refined Section 5 for clarity and coherence, reducing redundancies and improving presentation.
- Improved the interpretation in the geometry and OOD analyses and clarified baseline comparisons.
R3 - Minor editorial refinements; no material changes to claims, methodology, or results:
- Fixed a misstated intuition for the exp-map Jacobian term in Section 3 and clarified the small-angle limit under which the cosine-alignment reduction holds in Appendix B.
- Corrected an equation reference in Appendix C.2 (the Monte Carlo approximation targets the NLL of Eq. (4), not the ELBO).
- Fixed citation formatting in Appendices A.1 and B.
- Unified the bolding convention across Tables 4, 5, and 9 to indicate the column best and entries within one standard deviation.
Code: https://github.com/aoji/vje
Assigned Action Editor: ~Ole_Winther1
Submission Number: 7339
Loading