Analyzing the Impact of Learnable Softmax Temperature in Contrastive Visual-Textual Alignment Systems: Benefits, Drawbacks, and Alternative Approaches

Zhun Sun; Chao Li

Analyzing the Impact of Learnable Softmax Temperature in Contrastive Visual-Textual Alignment Systems: Benefits, Drawbacks, and Alternative Approaches

Zhun Sun, Chao Li

Published: 03 Nov 2024, Last Modified: 03 Nov 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: This work does NOT read like “fabricate motivation - propose something - obtain sota results”. Instead, we provide an in-depth analysis of the learnable softmax temperature parameter in the practical training of contrastive visual-textual alignment models, commonly known as CLIP models. This parameter is critical for optimal system performance, yet its mechanism and potential drawbacks have been largely overlooked. Our study addresses this gap and proposes a novel solution by utilizing the architecture of Vision Transformers (ViTs). We focus on the crucial role of the softmax temperature in managing noisy training data. We demonstrate that there is a balance in the gradient of the contrastive loss, with the temperature parameter acting as a distance scaling factor. If not properly calibrated, the model struggles to align positive pairs due to numerical issues in the loss term. Conversely, a high temperature can lead to unstable learning dynamics. We explore alternative approaches to mitigate this problem from a topological perspective of the contrastive loss. Ultimately, we leverage multiple class tokens embedded within the transformer architecture to present a concise solution. This configuration significantly enhances zero-shot classification performance, improving baseline CLIP models pretrained on large-scale datasets by an average of 6.1%.

Submission Length: Regular submission (no more than 12 pages of main content)

Previous TMLR Submission Url: https://openreview.net/forum?id=zNRGSv7WKm

Changes Since Last Submission: Since last submission: - We update the abstract and introduction of the manuscript to give a clear overview to readers who are not familiar with this topic. - We correct a crucial typo in Table 2, our proposed method should be CLIP[Multi(32,16)] (16 [CLS] tokens) instead of (16, 32). - Clarify our first contribution w.r.t. the role of the learnable softmax temperature. - Re-design the figures for code and visualization (Former Fig. 1&3), moving part of them to Appendix. - Update the description of the oblique manifold, and add more details of its properties related to this study. - Add a figure (Figure 1) to illustrate the ratio of alignment loss to uniformity loss under different temperatures. - Add a figure (Figure 2) to illustrate the bounded negative distance problem. - Add a figure (Figure 4) to illustrate the impact of temperature on the loss landscape. - Add a figure (Figure 5) to illustrate the overall system of vanilla CLIP and the multi-token CLIP. - Update Section 3.1 (Methodology) to provide more implementation details of the proposed approach. - Clarify the discussion made in Section 3.2.iii. - Update the details of hyper-parameters in Section A.1, providing more discussion on the selection of hyper-parameters. - Add experiment results on ECCV datasets in Section A.5. - Clarify the discussion of the mixture-of-expert hypothesis in Section A.7. - Fix typos, the name of the MSCOCO dataset, and other format problems. - Add a tiny survey on the noisy visual-textual correspondences in Section A.9. During the rebuttal phrase: - We provide detailed motivation and discussion when we introduce the product sphere configuration in Section 2. - We provide some mathematical backgrounds about the product sphere (oblique) topology and geodesic distance in the Appendix. - We add more information on the computational complexity, we also discuss how to choose the optimal number of the spheres in the Appendix. - In the mixture-of-experts part, we give an intuitive discussion on the relationship between the multiply class tokens. - We rewrite Section 2.2: 'Relaxed' triangular inequality to improve the mentioned problems. - We add a discussion about the datasets (ours and OpenCLIP's) in Section 4.1. - We provide more experiment results in the Appendix. In the camera-ready version: - We clarify that the proposed conditions are not necessary and sufficient for training a CLIP system; rather, they are used to analyze the temperature parameter. We update the corresponding sections to make this distinction clearer. - We include a brief explanation of the inspiration behind the equilibrium conditions before Section 2.2. - We update the explanation of the toy experiment design to better focus on the temperature parameter. - We correct the citation format. - We revise the writing to make the language sound more natural to the best of our ability.

Code: https://github.com/minogame/clip-mtob

Assigned Action Editor: ~Simon_Kornblith1

Submission Number: 2663

Loading