Abstract: Compressive learning (CL) has proven to be highly successful in executing joint signal sampling and inference for intricate vision tasks through resource-limited Internet of Things (IoT) devices. Recent studies have turned their attention toward utilizing the deep neural networks (DNNs) methodology, also known as DeepCL, to enhance performance in unimodal vision tasks. This approach incorporates learnable compressed sensing in a comprehensive, end-to-end manner. Current DeepCL techniques typically employ initial signal reconstruction as the input for subsequent DNNs for inference. However, this practice presents potential risks, such as privacy breaches and reduced performance due to information processing inequality. To address these issues, this article introduces the first cross-modal CL (CMCL) approach that enables image captioning directly on compressed measurements. When compared to previous DeepCL strategies, the proposed CMCL offers significant improvements in computational efficiency and privacy protection. Extensive experiments demonstrate that CMCL performance is nearly on par with leading image captioning methods, showcasing a metric value that is merely 2.75% lower than the uncompressed method when the data is compressed eightfold.
Loading