Contrastive Learning for Test-Time Training Layers

TMLR Paper6157 Authors

09 Oct 2025 (modified: 15 Oct 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Transformers have become the predominant architecture for sequence modeling, achieving state-of-the-art performance across natural language processing and other sequential domains. Despite their success, the quadratic complexity of self-attention imposes substantial computational and memory costs for long-context tasks. While alternative approaches, such as State-Space Models and linear attention, offer improved efficiency, they remain constrained in expressiveness and modeling long-range dependencies. Test-Time Training (TTT) layers provide a more flexible framework by parameterizing hidden states with nonlinear, input-dependent updates; however, prior approaches have relied on reconstruction-based objectives that lack justification and consideration of alternative learning methods. In this work, we propose Contrastive Test-Time Training (CTT), which integrates a contrastive learning objective into the TTT framework to explicitly align relevant query–value pairs while suppressing irrelevant features. On language modeling tasks at 140M parameters, CTT can match the performance of existing TTT models, indicating that it is neither detrimental nor inferior at smaller scales. Although not beneficial on its own, our evidence shows that CTT amplifies properties observed in models using Muon-based optimizers -- those at the state-of-the-art for training larger models. This suggests that CTT has the potential to surpass TTT approaches once scaled to large model sizes.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Razvan_Pascanu1
Submission Number: 6157
Loading