Toward Trustworthy Vision-Language Reporting for Tremor Assessment under Distribution Shift

Published: 21 Apr 2026, Last Modified: 21 Apr 2026TrustVLMEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision-Language Models, Abstention, Uncertainty, Calibration, Healthcare
Abstract: Vision-language models (VLMs) are increasingly used in high-stakes workflows, yet reliable deployment depends on more than raw multimodal capability alone. In healthcare settings, trustworthy use additionally requires calibration under distribution shift, selective abstention, and bounded reporting grounded in structured evidence. We present a VLM-assisted framework for tremor assessment from monocular RGB video, where modular hand-object perception and temporal modeling first extract structured clinical evidence, and a constrained reporting layer then generates clinician-facing or patient-facing outputs under uncertainty-aware abstention. A baseline-aware patient state supports longitudinal comparison against prior function. We evaluate the system on a pilot dataset of Parkinson’s disease, essential tremor, and control participants recorded with multiple consumer devices and viewpoints. Beyond strong clinician-aligned severity estimation, the main result is that constrained VLM reporting with abstention substantially reduces unsupported outputs compared with free-form and forced-answer baselines, while remaining stable under moderate device and viewpoint shift. These findings suggest that trustworthy VLM use in healthcare benefits from structured intermediate representations, calibration, selective prediction, and abstaining assistance rather than unrestricted multimodal generation.
Submission Number: 7
Loading