Keywords: Large Language Models, Parallel Computing, Neural Networks, Transformer Architecture, Model Serving
Abstract: Large Language Models demonstrate remarkable capabilities at the cost of
high compute requirements. While recent research has shown that interme-
diate layers can be removed or have their order shuffled without impacting
performance significantly, these findings have not been employed to reduce
the computational cost of inference. We investigate several potential ways
to reduce the depth of pre-trained LLMs without significantly affecting
performance. Leveraging our insights, we present a novel approach that
exploits this decoupling between layers by grouping some of them into pairs
that can be evaluated in parallel. This modification of the computational
graph—through better parallelism—results in an average improvement of
around 1.20x on the number of tokens generated per second, without re-
training nor fine-tuning, while retaining 95%-99% of the original accuracy.
Empirical evaluation demonstrates that this approach significantly improves
serving efficiency while maintaining model performance, offering a practical
improvement for large-scale LLM deployment.
Submission Number: 32
Loading