Deep Recurrent Architecture based Scene Description Generator for Visually Impaired

Aviral Chharia, Rahul Upadhyay

Published: 2020, Last Modified: 13 Sept 2023ICUMT 2020Readers: Everyone

Abstract: Vision is the most essential sense for human beings. But today, more than 2.2 billion people worldwide suffer from some form of vision impairment. This paper presents an end-to-end human-centric model for aiding the visually impaired by employing the deep recurrent architecture of the start-of-the-art image captioning models. A VGG-16 net convolutional neural network (CNN) is used to extract feature vectors from real-time video (image frames) and an long short-term memory (LSTM) network is employed to generate captions from these feature vectors. The model is tested on the Flickr 8K Dataset, one of the most popularly used image captioning dataset which contains over 8000 images. On real-time videos, the model generates rich descriptive captions which are converted to audio for a visually impaired person to listen. Comprehensively the model generates promising results which has great potential to enhance the lives of the visually impaired people by assisting them to get a better understanding of their surroundings.

0 Replies