Optical Transformers

11 May 2023 (modified: 12 Dec 2023)Submitted to NeurIPS 2023EveryoneRevisionsBibTeX
Keywords: Optics, Transformers, Accelerator, Energy Efficiency, Power Consumption, LLM, Large Language Models, Hardware, Optical Neural Networks, Scaling, Scaling Laws, Quantization
TL;DR: Running Transformers on optical accelerator hardware could yield orders-of-magnitude energy-efficiency advantages.
Abstract: The rapidly increasing size of deep-learning models has caused renewed and growing interest in alternatives to digital computers to dramatically reduce the energy cost of running state-of-the-art neural networks. Optical matrix-vector multipliers are best suited to performing computations with very large operands, which leads us to hypothesize that large Transformer models might achieve asymptotic energy advantages with optics over running digitally. To test this idea, we performed small-scale optical experiments with a prototype accelerator to demonstrate that Transformer operations can run on optical hardware despite noise and errors. Using experiment-calibrated simulations of our hardware, we studied the behavior of running Transformers optically, identifying scaling laws for model performance with respect to optical energy usage and estimating total system power consumption. We found that the optical energy per multiply-accumulate (MAC) scales as $\frac{1}{d}$ where $d$ is the Transformer width, an asymptotic advantage over digital systems. Should well-engineered, large-scale optical hardware be developed, it might achieve a $100 \times$ energy-efficiency advantage for running some of the largest current Transformer models, and if both the models and the optical hardware are scaled to the quadrillion-parameter regime, optical computers could have a $>8,000\times$ energy-efficiency advantage over state-of-the-art digital-electronic processors (300 fJ/MAC). We discussed how these results motivate and inform the construction of future optical accelerators and optics-amenable deep-learning approaches. With assumptions about future improvements to electronics and Transformer quantization techniques (5× cheaper memory access, double the digital–analog conversion efficiency, and 4-bit precision), we estimated that optical computers' advantage against these digital processors could grow to $>100,000\times$.
Supplementary Material: zip
Submission Number: 10272
Loading