Optical Transformers

Maxwell Anderson; Shi-Yuan Ma; Tianyu Wang; Logan Wright; Peter McMahon

Optical Transformers

Maxwell Anderson, Shi-Yuan Ma, Tianyu Wang, Logan Wright, Peter McMahon

Published: 25 Mar 2024, Last Modified: 17 Sept 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: The rapidly increasing size of deep-learning models has renewed interest in alternatives to digital-electronic computers as a means to dramatically reduce the energy cost of running state-of-the-art neural networks. Optical matrix-vector multipliers are best suited to performing computations with very large operands, which suggests that large Transformer models could be a good target for them. In this paper, we investigate---through a combination of simulations and experiments on prototype optical hardware---the feasibility and potential energy benefits of running Transformer models on future optical accelerators that perform matrix-vector multiplication. We use simulations, with noise models validated by small-scale optical experiments, to show that optical accelerators for matrix-vector multiplication should be able to accurately run a typical Transformer architecture model for language processing. We demonstrate that optical accelerators can achieve the same (or better) perplexity as digital-electronic processors at 8-bit precision, provided that the optical hardware uses sufficiently many photons per inference, which translates directly to a requirement on optical energy per inference. We studied numerically how the requirement on optical energy per inference changes as a function of the Transformer width $d$ and found that the optical energy per multiply--accumulate (MAC) scales approximately as $\frac{1}{d}$, giving an asymptotic advantage over digital systems. We also analyze the total system energy costs for optical accelerators running Transformers, including both optical and electronic costs, as a function of model size. We predict that well-engineered, large-scale optical hardware should be able to achieve a $100 \times$ energy-efficiency advantage over current digital-electronic processors in running some of the largest current Transformer models, and if both the models and the optical hardware are scaled to the quadrillion-parameter regime, optical accelerators could have a $>8,000\times$ energy-efficiency advantage. Under plausible assumptions about future improvements to electronics and Transformer quantization techniques (5× cheaper memory access, double the digital--analog conversion efficiency, and 4-bit precision), we estimate that the energy advantage for optical processors versus electronic processors operating at 300~fJ/MAC could grow to $>100,000\times$.

Submission Length: Regular submission (no more than 12 pages of main content)

Supplementary Material: zip

Assigned Action Editor: ~Jaehoon_Lee2

Submission Number: 1643

Loading