Abstract: This paper proposes GreenLLM, a framework that effectively deploys generative Large Language Models (LLMs) on resource-limited edge devices to well meet the memory and timing constraints with minimized energy consumption. Specifically, GreenLLM employs an energy estimation scheme based on physical hardware to guide a pruning-ratio generator incorporating space, weight, and power (SWaP) constraints for optimal pruning ratio. For each layer, we employ a dependency-aware energy-efficient Pruner in a task-agnostic manner, maximally preserving most of the LLM functionality. Finally, we use downstream datasets to fine-tune the pruned model to recover performance.
Loading