Keywords: Tensor Program Tuning, LLM, Hardware
Abstract: Tensor program tuning is critical for inference acceleration of deep neural networks (DNNs), especially Large Language Models (LLMs). Yet its effectiveness hinges on cost models for accurate performance estimation.
Existing cost models rely on manually designed hardware-specific features and extensive profiling data. Thus they suffer from high development costs, poor efficiency, and limited generalization, and become to a significant bottleneck in the face of rapidly evolving models and hardware.
In this paper, we propose LLMTuner, a novel framework enabling LLMs to analyze tensor program execution behaviors and accurately estimate tensor program performance across diverse hardware. LLMTuner introduces a coarse-to-fine process: a lightweight LLM-based classifier first filters out suboptimal programs, then a finetuned LLM infers multi-dimensional execution behavior scores to predict latency across different hardware.
Experiments demonstrate that LLMTuner significantly improves estimation accuracy by up to 64.8\%, compared with general-purpose LLMs and other cost models on benchmark datasets across 6 CPU and 5 GPU platforms.
It can even accurately estimate performance on unseen hardware, achieving 49.2\% accuracy improvement over other cost models.
For practical DNN and LLM tuning tasks, compared with other cost models, LLMTuner could discover superior program performance (1.47$\times$) with up to 3.27$\times$ tuning efficiency.
Moreover, LLMTuner with finetuned lightweight LLMs reduces the estimation time by over 30$\times$ compared to DeepSeek R1.
Primary Area: infrastructure, software libraries, hardware, systems, etc.
Submission Number: 13627
Loading