One for all: Zero-Shot Cross-Hardware Performance Modeling with LLMs for Tensor Program Tuning

Wenyi Li; Shaohui Peng; Congying Ma; Haochen Li; Yunji Chen; Yongwei Zhao; Yuanbo Wen; KE GAO; Qi Guo; Ling Li

One for all: Zero-Shot Cross-Hardware Performance Modeling with LLMs for Tensor Program Tuning

Wenyi Li, Shaohui Peng, Congying Ma, Haochen Li, Yunji Chen, Yongwei Zhao, Yuanbo Wen, KE GAO, Qi Guo, Ling Li

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Tensor Program Tuning, LLM, Hardware

Abstract: Tensor program tuning is critical for inference acceleration of deep neural networks (DNNs), especially Large Language Models (LLMs). Yet its effectiveness hinges on cost models for accurate performance estimation. Existing cost models rely on manually designed hardware-specific features and extensive profiling data. Thus they suffer from high development costs, poor efficiency, and limited generalization, and become to a significant bottleneck in the face of rapidly evolving models and hardware. In this paper, we propose LLMTuner, a novel framework enabling LLMs to analyze tensor program execution behaviors and accurately estimate tensor program performance across diverse hardware. LLMTuner introduces a coarse-to-fine process: a lightweight LLM-based classifier first filters out suboptimal programs, then a finetuned LLM infers multi-dimensional execution behavior scores to predict latency across different hardware. Experiments demonstrate that LLMTuner significantly improves estimation accuracy by up to 64.8\%, compared with general-purpose LLMs and other cost models on benchmark datasets across 6 CPU and 5 GPU platforms. It can even accurately estimate performance on unseen hardware, achieving 49.2\% accuracy improvement over other cost models. For practical DNN and LLM tuning tasks, compared with other cost models, LLMTuner could discover superior program performance (1.47$\times$) with up to 3.27$\times$ tuning efficiency. Moreover, LLMTuner with finetuned lightweight LLMs reduces the estimation time by over 30$\times$ compared to DeepSeek R1.

Primary Area: infrastructure, software libraries, hardware, systems, etc.

Submission Number: 13627

Loading