SinGLU: Sinusoidal Gated Linear Units Improve Classification Accuracy of Small Vision Transformers

SinGLU: Sinusoidal Gated Linear Units Improve Classification Accuracy of Small Vision Transformers

06 Mar 2026 (modified: 15 May 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Gated Linear Unit (GLU) variants such as SwiGLU are now widely used in modern Transformers. However, the GLU functions explored in the recent literature represent only a small fraction of the possible GLU design space. Starting from a systematic enumeration of a restricted family of zeroth-, first-, and second-order GLU-type formulas, we conduct a controlled study on ViT‑Tiny across CIFAR‑10, CIFAR‑100, SVHN and ImageNet‑64, instantiating each GLU formula with Sigmoid, Tanh and Sin activations. Under identical training recipes and matched parameter counts, our proposed first-order variant \textbf{SinGLU} achieves higher mean accuracy than SwiGLU across the datasets tested in this ViT-Tiny setting. Inference latency differs by <0.1\% on an NVIDIA A100 GPU, confirming cost parity. All code and model weights will be released upon publication.

Submission Type: Long submission (more than 12 pages of main content)

Changes Since Last Submission: We revised the manuscript substantially in response to the reviewer comments: 1. We clarified the motivation for SinGLU and the use of the Sin activation. The introduction now explains that $\sin(x)$ behaves similarly to $\tanh(x)$ near zero, while introducing non-monotonic modulation away from zero, motivating its use in pre-normalized Transformer MLPs. 2. We corrected and clarified the GLU-type layer definitions. The original GLU equation now explicitly includes the Sigmoid gate, and we distinguish between the original biased GLU formulation and Shazeer's bias-free T5-style GLU variants. 3. We revised our generalized GLU notation and restricted enumeration. The manuscript now explicitly states that the enumeration is over a restricted family of affine-projection sharing patterns from 0th to 2nd order, rather than an exhaustive enumeration of all possible GLU-type functions. 4. We clarified that the 0th-order variant $Z_1=\phi(x_1)$ is not a gating mechanism, but a standard activation-function baseline included for comparison. 5. We added a discussion explaining how higher-order variants differ from first-order variants, including how products of independent affine projections introduce bilinear or quadratic-like interactions. 6. We added clarification on how GELU and GEGLU could be recovered within the same notation by using the Gaussian CDF as the gating function, while noting that ReGLU and GEGLU are not explicitly evaluated in this study and remain future work. 7. We moved the formal definition and discussion of SinGLU out of the introduction and into the GLU-type layer function section, improving the logical flow of the paper. 8. We clarified that although Sin is periodic as an activation function, SinGLU is not generally periodic as a function of its input because the sinusoidal gate is multiplied by an affine value pathway. 9. We removed the previous toy loss-landscape figure and replaced the surrounding discussion with a shorter, more focused motivation for sinusoidal activations in pre-norm Transformer MLPs. 10. We revised the figure captions and table notes for clarity, especially to distinguish SinGLU and SwiGLU and to explain the reported means and standard deviations. 11. We now report standard deviations for CIFAR-10, CIFAR-100, and SVHN results over three random seeds. 12. We softened claims about statistical significance and performance. In particular, the ImageNet64 experiment is now described as a single-run comparison and treated as supporting evidence rather than a statistically validated result. 13. We clarified the CIFAR-100 results, explicitly noting that SinGLU outperforms SwiGLU on CIFAR-100 but is not the best-performing layer function on that dataset. 14. We revised the discussion of evaluation scope to make clear that the goal is not state-of-the-art performance, but controlled comparison between layer functions under matched ViT-Tiny training conditions. 15. We corrected typos, notation issues, figure sequencing, and several unclear or overstrong statements throughout the manuscript.

Assigned Action Editor: ~Wuyang_Chen1

Submission Number: 7814

Loading