Real-Time Scene Understanding for Blind Users: Enhancing Vision-Language Models for Accessibility

Loan Gia

Real-Time Scene Understanding for Blind Users: Enhancing Vision-Language Models for Accessibility

Loan Gia

Published: 28 Aug 2025, Last Modified: 28 Aug 2025CV4A11yEveryoneRevisionsBibTeXCC BY 4.0

Keywords: accessibility, vision-language models, quantization, bias mitigation, assistive technology, real-time systems, edge AI

TL;DR: Real-time vision-language system for blind users combining efficient quantization, action-aware prompting, and multi-stage bias mitigation, achieving 89% obstacle recall with 760ms latency while reducing demographic bias by 72%.

Abstract: This paper presents a real-time vision-language system optimized for assistive accessibility, combining three key innovations: (1) hybrid 4/8-bit quantization for efficient edge deployment, (2) reinforcement learning-based dynamic prompting for actionability, and (3) multi-stage bias mitigation. Our method achieves 89.1\% obstacle recall (20.9\% improvement over SeeingAI) with 760ms latency on mobile devices, while reducing demographic bias by 72\% compared to standard VLMs. Evaluations on VizWiz-Grounding and FairFace demonstrate superior performance across accuracy (CIDEr 84.9), fairness (Disability Error 0.14), and usability metrics (4.5/5 user rating). The system addresses critical gaps in assistive technology through novel techniques like whitened feature projection and adaptive thresholding, enabling inclusive AI-powered accessibility without compromising real-time performance.

Submission Number: 22

Loading