When Does Margin Clamping Affect Training Variance? Dataset-Dependent Effects in Contrastive Forward-Forward Learning

TMLR Paper7400 Authors

08 Feb 2026 (modified: 23 Feb 2026)Withdrawn by AuthorsEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Contrastive Forward-Forward (CFF) learning is a layer-local alternative to backpropagation that trains Vision Transformers using supervised contrastive objectives at each layer independently. In practice, CFF can exhibit substantial seed-to-seed variability, complicating reproducibility and hyperparameter selection. We audit one implementation detail inside the supervised contrastive loss: applying the positive-pair margin via saturating similarity clamping, min⁡(s+m,1)\min(s + m, 1) min(s+m,1). We compare this against a post-log-probability subtraction reference that we prove is gradient-neutral under the mean-over-positives reduction (Proposition 4.1), thereby isolating the effect of saturation itself. On CIFAR-10 in a 2×22 \times 2 2×2 factorial ablation (n=7n=7 n=7 seeds per cell), the clamped variant exhibits 5.90×5.90\times 5.90× higher pooled test-accuracy variance (p=0.003p=0.003 p=0.003, bootstrap 95% CI [1.62,15.80][1.62, 15.80] [1.62,15.80]) with no detectable difference in mean accuracy. Clamp activation rates (CAR), layerwise gradient norms, and a reduced-margin dose-response probe jointly indicate that this variance increase is associated with gradient truncation in early transformer layers. However, the effect is dataset-dependent: replication on CIFAR-100 (VR=0.39×\mathrm{VR} = 0.39\times VR=0.39×), SVHN (VR=0.25×\mathrm{VR} = 0.25\times VR=0.25×), and Fashion-MNIST (VR=0.08×\mathrm{VR} = 0.08\times VR=0.08×, p=0.029p=0.029 p=0.029) reveals inverted variance ratios in all three cases. Cross-dataset analysis identifies layer-0 clamp activation rate as a necessary but insufficient condition for variance inflation: CIFAR-10's high L0 CAR (60.7%) co-occurs with the only elevated variance ratio, while CIFAR-100's low L0 CAR (29.0%) and SVHN/Fashion-MNIST's high task accuracy (>92%>92\% >92%) each independently suppress the effect. An SVHN difficulty sweep confirms this interaction: increasing augmentation difficulty on the same dataset drives the variance ratio from 0.25×0.25\times 0.25× to 16.73×16.73\times 16.73×. These results characterize the conditions under which margin clamping destabilizes CFF training and offer practical guidance for practitioners.
Submission Type: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Martin_Mundt1
Submission Number: 7400
Loading