Abstract: Learning well-separated features in high-dimensional spaces, such as text or image embeddings,
is crucial for many machine learning applications. Achieving such separation can be effectively
accomplished through the dispersion of embeddings, where unrelated vectors are pushed
apart as much as possible. By constraining features to be on a hypersphere, we can connect
dispersion to well-studied problems in mathematics and physics, where optimal solutions are
known for limited low-dimensional cases. However, in representation learning we typically deal
with a large number of features in high-dimensional space, and moreover, dispersion is usually
traded off with some other task-oriented training objective, making existing theoretical and
numerical solutions inapplicable. Therefore, it is common to rely on gradient-based methods
to encourage dispersion, usually by minimizing some function of the pairwise distances.
In this work, we first give an overview of existing methods from disconnected literature,
making new connections and highlighting similarities. Next, we introduce some new angles.
We propose to reinterpret pairwise dispersion using a maximum mean discrepancy (MMD)
motivation. We then propose an online variant of the celebrated Lloyd’s algorithm, of
K-Means fame, as an effective alternative regularizer for dispersion on generic domains.
Finally, we revise and empirically assess sliced regularizers that directly exploit properties
of the hypersphere, proposing a new, simple but effective one. Our experiments show the
importance of dispersion in image classification and natural language processing tasks, and
how algorithms exhibit different trade-offs in different regimes
Submission Length: Regular submission (no more than 12 pages of main content)
Code: https://github.com/ltl-uva/ledoh-torch
Assigned Action Editor: ~Rémi_Flamary1
Submission Number: 4726
Loading