Analyzing the Impact of Learnable Softmax Temperature in Contrastive Visual-Textual Alignment Systems: Benefits, Drawbacks, and Alternative Approaches

TMLR Paper2663 Authors

10 May 2024 (modified: 31 May 2024)Under review for TMLREveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: This work does NOT read like “fabricate motivation - propose something - obtain sota results”. Instead, we give an analysis of the learnable softmax temperature parameter in the practical training of contrastive visual-textual alignment learning model (commonly referred to as the “CLIP” model). This parameter is considered to be imperative for optimal system performance, however, its working mechanism and possible drawbacks have long been neglected. This study addresses this problem as well as offers a novel solution by leveraging the structure of ViTs. Our argument centers around the pivotal role of the softmax temperature in handling noisy training data. We visualize that there exists an equilibrium in the gradient of the contrastive loss, while the temperature parameter serves as a distance scaling factor. Otherwise, the model has trouble aligning positive pairs due to a numerical problem in the loss term. On the contrary, we also show that a large temperature would result in possible unstable learning dynamics. Subsequently, we figured out alternative approaches that could mitigate the problem from a topological view of the contrastive loss. Finally, we capitalize on multiple class tokens embedded within the transformer architecture to offer a concise solution. This configuration significantly boosts zero-shot classification performance, enhancing baseline CLIP models pretrained on large-scale datasets by an average of 6.1%. The codes and learned weights are provided in https://github.com/{Anonymous_authors}.
Submission Length: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=zNRGSv7WKm
Changes Since Last Submission: - We update the abstract and introduction of the manuscript to give a clear overview to readers who are not familiar with this topic. - We correct a crucial typo in Table 2, our proposed method should be CLIP[Multi(32,16)] (16 [CLS] tokens) instead of (16, 32). - Clarify our first contribution w.r.t. the role of the learnable softmax temperature. - Re-design the figures for code and visualization (Former Fig. 1&3), moving part of them to Appendix. - Update the description of the oblique manifold, and add more details of its properties related to this study. - Add a figure (Figure 1) to illustrate the ratio of alignment loss to uniformity loss under different temperatures. - Add a figure (Figure 2) to illustrate the bounded negative distance problem. - Add a figure (Figure 4) to illustrate the impact of temperature on the loss landscape. - Add a figure (Figure 5) to illustrate the overall system of vanilla CLIP and the multi-token CLIP. - Update Section 3.1 (Methodology) to provide more implementation details of the proposed approach. - Clarify the discussion made in Section 3.2.iii. - Update the details of hyper-parameters in Section A.1, providing more discussion on the selection of hyper-parameters. - Add experiment results on ECCV datasets in Section A.5. - Clarify the discussion of the mixture-of-expert hypothesis in Section A.7. - Fix typos, the name of the MSCOCO dataset, and other format problems. - Add a tiny survey on the noisy visual-textual correspondences in Section A.9.
Assigned Action Editor: ~Simon_Kornblith1
Submission Number: 2663
Loading