Transformer-Based Nonautoregressive Image Captioning via Guided Keyword Generation and Learnable Positional Encoding for IoT Devices
Abstract: The emergence of the Intelligent Internet of Things (IIoT) has brought data processing closer to data sources, especially for real-time processing of surveillance video and image analysis. Image captioning plays a crucial role in understanding images. However, the Transformer architecture, which has become prevalent in recent applications, has been observed to increase the computational resources required for image captioning models. Conventionally, most existing methods use the autoregressive paradigm, which reduces their computational efficiency on edge devices and results in significant inference delays. In this article, we use nonautoregressive paradigms to improve its inference speed and model efficiency. Nevertheless, the lack of effective inputs results in a performance gap between nonautoregressive and autoregressive models. To bridge this gap, we propose the learnable positional encoding and keyword guided nonautoregressive image captioning. First, a diffusion model guided by image features is employed to generate keywords that accurately reflect the image content, thereby infusing a substantial amount of semantic information into the nonautoregressive decoder. Second, positional encoding is utilized to guide the decoder in generating appropriate words at the correct positions within the caption. Extensive experiments on widely used benchmarks demonstrate that our model achieves state-of-the-art performance in nonautoregressive image captioning. Furthermore, our model maintains a competitive inference speed.
External IDs:dblp:journals/iotj/LiuYLHZL25
Loading