Abstract: The remarkable capabilities of Large Language Models (LLMs) are overshadowed by their immense
computational cost. While recent work has shown that many LLM layers can be reordered or even
removed with minimal impact on accuracy, these insights have not been translated into significant
inference speedups. To bridge this gap, we introduce a novel method that restructures the
computational graph by grouping and evaluating consecutive layer pairs in parallel. This approach,
requiring no retraining, yields a 1.19x throughput gain on Llama 2 7B while reducing the average
benchmark accuracy by only 1.5%. We demonstrate the practical value of this method for large-scale
LLM deployment and show that some of the lost accuracy can be recovered with lightweight
fine-tuning of the parallelized layers.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Changyou_Chen1
Submission Number: 5853
Loading