Keywords: model compression, tensor programs, program synthesis, neural network optimization, efficient inference, hardware-aware compression, beyond quantization, automated ML
Abstract: Traditional model compression techniques like quantization and pruning achieve significant efficiency gains but often degrade performance in complex models and fail to exploit hardware-specific optimizations. We present TensorCompress, a novel framework that uses tensor program synthesis to generate optimized computational graphs beyond conventional methods. Our approach combines automated program search with hardware-aware rewriting rules to produce compressed models that maintain accuracy while reducing inference time and memory footprint. Theoretical analysis proves optimality bounds for synthesized programs, and experiments on large-scale models show 50% better compression ratios than state-of-the-art quantization, with negligible accuracy loss across vision and language tasks. The framework demonstrates 3x speedup on edge devices and 70% energy savings in deployment scenarios.
Submission Number: 279
Loading