An End-to-End Deep Learning Approach for Video Captioning Through Mobile Devices

Rafael Jeferson Pezzuto Damaceno, Roberto M. Cesar Jr.

Published: 01 Jan 2023, Last Modified: 30 Oct 2024CIARP 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Video captioning is a computer vision task that aims at generating a description for video content. This can be achieved using deep learning approaches that leverage image and audio data. In this work, we have developed two strategies to tackle this task in the context of resource-constrained devices: (i) generating one caption per frame combined with audio classification, and (ii) generating one caption for a set of frames combined with audio classification. In these strategies, we have utilized one architecture for the image data and another for the audio data. We have developed an application tailored for resource-constrained devices, where the image sensor captures images at a specific frame rate. The audio data is captured from a microphone for a predefined duration at time. Our application combines the results from both modalities to create a comprehensive description. The main contribution of this work is the introduction of a new end-to-end application that can utilize the developed strategies and be beneficial for environment monitoring. Our method has been implemented on a low-resource computer, which poses a significant challenge.