Keywords: NLP, TTS, OCR, Vision Transformer, LMMs
Abstract: Comics are a hybrid storytelling medium combining imagery
and text. Despite their popularity, comics are often inaccessible
to visually impaired individuals and non-native readers. We introduce
ComicVerse, an AI-driven system that transforms comic PDFs
into narrated stories and audiobooks using a combination of deep
learning techniques. The system leverages vision-language models
(GPT-4o-mini), large language models for narrative synthesis, and neural
text-to-speech (TTS-1) systems to produce high-quality audio. A key
novelty lies in fusing visual data (images) with OCR text to form a
multimodal prompt for story generation. The system supports style
control and multilingual output, and is deployed via a user-friendly
Streamlit interface. Our method illustrates the integration of modern
deep learning APIs into a practical, creative AI application, enabling
inclusive and dynamic storytelling from static visual media.
Submission Number: 22
Loading