Abstract: Gated Linear Unit (GLU) variants such as SwiGLU are now widely used in modern Transformers. However, the GLU functions explored in the recent literature represent only a small fraction of the possible GLU design space. Starting from a mathematically complete enumeration of all zeroth-, first-, and second‑order GLU formulas, we conduct a controlled study on ViT‑Tiny across CIFAR‑10, CIFAR‑100, SVHN and ImageNet‑64, instantiating each GLU formula with Sigmoid, Tanh and Sin activations. Under identical training recipes and equal parameter counts, our proposed first order GLU variant \textbf{SinGLU} consistently outperforms SwiGLU, the de‑facto standard in contemporary Transformers. Inference latency differs by <0.1\% on an NVIDIA A100 GPU, confirming cost parity. All Code and model weights will be released upon publication.
Submission Type: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Wuyang_Chen1
Submission Number: 7814
Loading