Transformers on Consumer Hardware: A Critical Perspective of TinyML Optimization Techniques and Open Problems

TMLR Paper8979 Authors

16 May 2026 (modified: 25 May 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Deep learning models that are based on the transformer architecture have a reputation for requiring large compute resources for training and inference. This requirement has placed transformer-based models, such as large language and other generative AI models, beyond the reach of low-resource devices which make up most of the computer systems in the world. Conversely, machine learning is currently experiencing a revolution of a smaller sort, in which techniques under the umbrella of TinyML are optimizing feed-forward and convolutional models to run successfully on these low-resource devices. Gated access to high-performance compute clusters, rising compute costs, and lack of general access have driven research in combining these two fields. Today, TinyML techniques such as pruning, quantization, and software-hardware co-design are being applied to transformer-based models to deploy transformer-based models to low-resource and edge devices. Analysis of the surveyed works reveals that the techniques applied are largely orthogonal to one another, that knowledge distillation is significantly underrepresented, and that edge training remains rare. The most accessible path toward progress lies in combining the independently developed contributions already present in the literature.
Submission Type: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Yoshitomo_Matsubara1
Submission Number: 8979
Loading