Scaling of Optical Transformers

Published: 01 Nov 2023, Last Modified: 22 Dec 2023MLNCP OralEveryoneRevisionsBibTeX
Keywords: optics, accelerators, neuromorphic, energy, power consumption, transformers, language models, large language models, deep learning, neural networks, quantization, model compression, scaling laws
TL;DR: Transformer models' design and scale allow for large energy advantages if run on optical neural-network accelerator hardware.
Abstract: The rapidly increasing size of deep-learning models has renewed interest in alternatives to digital-electronic computers as a means to dramatically reduce the inference energy cost of running state-of-the-art neural networks. Optical matrix-vector multipliers are best suited to performing computations with very large operands, which suggests that large Transformer models could be a good target for them. However, the ability of optical accelerators to run efficiently depends on the model being run, and if the model can be run at all when subject to the noise, error, and low precision of analog-optical hardware. Here we investigate whether Transformers meet the criteria to be efficient when running optically, what benefits can be had for doing so, and how worthwhile it is at scale. We found using small-scale experiments on and simulation of a prototype hardware accelerator that Transformers may run on optical hardware, and that elements of their design --- the ability to parallel-process data using the same weights, and trends in scaling them to enormous widths --- allow them to achieve an asymptotic energy-efficiency advantage running optically compared to on digital hardware. Based on a model of a full optical accelerator system, we predict that well-engineered, large-scale optical hardware should be able to achieve a 100× energy-efficiency advantage over current digital-electronic processors in running some of the largest current Transformer models, and if both the models and the optical hardware are scaled to the quadrillion-parameter regime, optical accelerators could have a > 8,000× energy-efficiency advantage.
Submission Number: 29