LOGCA: Layer-Optimized GPU-CPU Allocation for Efficient Resource Management in Large-Scale Models

Zichen Song

LOGCA: Layer-Optimized GPU-CPU Allocation for Efficient Resource Management in Large-Scale Models

Zichen Song

Published: 16 Oct 2025, Last Modified: 10 Nov 2025NeurIPS 2025 ER WorkshopEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large-Scale Models; Efficient Resource Management

Abstract: Efficient deployment of large-scale models in resource-limited environments requires intelligent resource management. While prior methods like PowerInfer offload less important neurons to CPUs, they overlook the varying importance of model layers. We propose LOGCA (Layer-Optimized GPU-CPU Allocation), which dynamically assigns layers to GPU or CPU based on importance, measured via a weighted angular distance incorporating neuron activation strength. Critical layers are executed on GPU for efficiency, while less important ones are offloaded to CPU to save memory. LOGCA further introduces an adaptive thresholding mechanism that adjusts in real-time based on system load, improving scalability. Our method boosts computational speed and memory efficiency, making it well-suited for large-scale models in constrained settings.

Submission Number: 5

Loading