Can Neural Architecture Search Help us Find Faster LLM Architectures? Experiments with GPT-2 based Text Predictor
Abstract: Inference with Large Language Models (LLMs) is costly and often dominates the life-cycle cost of LLM-based services. Neural Architecture Search (NAS) can automatically find architectures optimizing the trade-offs between accuracy and inference cost. However, NAS for LLM architectures is computationally prohibitive. We apply the recently proposed LiteTransformerSearch algorithm to reduce the inference latency of a GPT-2 based Text Prediction system by 25% without compromising its accuracy. In the process, we discover some new constraints that apply on the optimal neural architectures, and are, therefore, useful in practice to further reduce the computational cost of NAS.
Paper Type: short
Research Area: Efficient/Low-Resource Methods for NLP
Contribution Types: Approaches low compute settings-efficiency
Languages Studied: English
Consent To Share Submission Details: On behalf of all authors, we agree to the terms above to share our submission details.
0 Replies
Loading