Track: Type A (Regular Papers)
Keywords: Dense Prediction, Model Compression, Structured Pruning, Quantization, Efficient Vision Models
Abstract: Dense prediction tasks such as classification, segmentation, and optical flow require models that deliver high accuracy while maintaining sufficient throughput for practical applications on mobile or portable computing devices. However, most state-of-the-art architectures rely on deep sequential operations that are computationally expensive and challenging to execute on consumer-grade parallel hardware, this often leads to reduced inference speed or degraded accuracy, thereby limiting their applicability in real-time and edge scenarios. To address this challenge, we propose a novel, self-compressing vision architecture that applies structured pruning and quantization across key modules: convolutional layers, transposed convolutions, and linear attention in proportion to their parallel-time computational cost. By selectively reducing precision and pruning tensors in less critical layers, our approach achieves significant model compression. We evaluate our method on fine-grained classification (CUB-200-2011, Country211), semantic segmentation (ADE20K), and optical flow (HD1K). Our model matches the accuracy of state-of-the-art baselines (Efficient VIT) at full precision (FP32) and surpasses them under lower-precision settings, achieves reduced storage and higher throughput, all while maintaining similar training time. Finally, we highlight that compression serves not only as a mechanism for reducing model size but also as a basis for investigating the relationship between model depth and overall performance during inference.
Project at:
\url{https://github.com/adishourya/SelfcompressingDepthWiseAttn}
Serve As Reviewer: ~Guangzhi_Tang1, ~Chang_Sun1
Submission Number: 47
Loading