Abstract: This paper addresses the challenges of efficient image acquisition and processing in resource-constrained environments by introducing sparsity-driven CMOS Image Sensor (CIS) architecture coupled with Vision Transformers (ViTs). Our proposed approach incorporates a sensor-level dimensionality reduction technique to capture sparse and high-relevance features for enhancing power and computational efficiency at the sensor level. Key proposals for the pipeline include a re-designed CIS architecture that enables selective feature acquisition through on-sensor edge detection, an adaptive threshold mechanism for reducing ADC operations (> 50% for τ = 0.9) through edge-based pixel selection, and a co-designed memory management strategy focused on patch-wise data retention. Experimental evaluations on CIFAR-10, STL-10, Food-101, and Caltech-256 show a reduction of ≈ 89%, 78%, 76%, and 81% data volume for 100 selected patches while maintaining accuracies of ≈ 94%, 93%, 74%, and 82%, respectively. These improvements in the acquisition architecture demonstrate scalable, real-time processing potential on edge devices, making the proposed architecture a robust solution for low-power applications requiring efficient data acquisition and processing, particularly for ViTs.
Loading