Abstract: The remarkable capabilities of Large Language Models (LLMs) are shadowed by their
immense computational cost. While recent work has shown that many LLM layers can be
reordered or even removed with minimal impact on accuracy, these insights have not been
translated into significant inference speedups. To bridge this gap, we introduce a novel
method that restructures the computational graph by grouping and evaluating consecutive
layer pairs in parallel. This approach, requiring no retraining, boosts inference
throughput by 1.05x–1.20x while maintaining 95-99% of the original model's accuracy on
standard benchmarks. We demonstrate the practical value of this method for
large-scale LLM deployment and show that some of the accuracy trade-off can be
recovered with lightweight fine-tuning of the parallelized layers.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Changyou_Chen1
Submission Number: 5853
Loading