Keywords: optimization; transformer training; attention prior; fuzzy systems; entropic transport; Sinkhorn; selective SWA; label smoothing; language modeling; time-series forecasting
TL;DR: We add a zero-cost, length-aware attention prior (RPA) and a tiny validation-driven controller that preserve late-phase gains in small Transformers. On WikiText-2, they lower CE without adding inference cost; time-series/equities are deferred.
Abstract: Small/medium Transformers often stall late in training as low learning rates and averaging wash out genuine incremental gains. We introduce two minimal, training-time additions: (1) a zero-cost attention prior built from fuzzy token-to-regime memberships aligned to a length-aware positional basis via entropic transport (RPA), and (2) a tiny “gain-aware” controller that sharpens attention only when validation improvements justify it. We also use a simple optimization recipe (non-zero LR floor, selective SWA) to preserve late-phase progress. The RPA prior is standardized to remain commensurate with content logits and to play nicely with softmax’s row-shift invariance. Under compute parity on WikiText-2 (raw-v1, GPT-2 BPE), our recipe reduces validation cross-entropy without increasing inference cost. We position this as a practical, proof-of-work step: the fuzzy inductive bias is the key lever; alignment and the controller are small auxiliary aids. To support reproducibility, the appendix includes the algorithmic listings (majority of the code) required to recreate our runs. We provide loaders/configs for time-series/equities to show applicability, but we do not report results on those targets in this submission, rather to show some potential future applications of this setup.
Primary Area: optimization
Submission Number: 23470
Loading