Multitask Asynchronous Bidirectional Multimodal Agent for Personalized Treatment Companions

Published: 06 Oct 2025, Last Modified: 06 Oct 2025NeurIPS 2025 2nd Workshop FM4LS PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: asynchronous task, bidirectional streaming, multimodal agent, multimodal RAG, personalized treatment
TL;DR: This paper introduces a streaming multimodal agent that integrates vision, speech, and retrieval-augmented reasoning to support real-time monitoring, dialogue, and adaptive guidance for trustworthy personalized treatment.
Abstract: Personalized treatment requires intelligent systems that can continuously monitor patients, adapt to evolving conditions, and communicate naturally with both patients and clinicians. Existing healthcare technologies often rely on unimodal data streams (e.g., wearables or medical imaging) or offline analysis, limiting their responsiveness and interactivity. In this work, we demonstrated a Multitask Asynchronous Bidirectional Multimodal Agent powered by a multimodal large language model (MLLM) and integrated with retrieval-augmented generation (RAG) from multimodal sources, including text, images, and video. The agent combines vision (video) & audio (speech) for patient monitoring & natural interaction, supporting real-time personalized treatment. We define three representative asynchronous tasks: (i) vision-based patient monitoring for mobility, posture, and facial cues, (ii) aggregation of health metrics for adaptive treatment planning, and (iii) speech-based dialogue to engage patients and support clinician decision-making. Our architecture integrates Gemini’s multimodal reasoning with a WebSocket-based backend for bidirectional streaming interaction, enabling both proactive alerts and conversational explanations. Evaluation on simulated healthcare monitoring datasets demonstrates improved accuracy in patient state recognition, reduced latency in adaptive feedback, and enhanced interpretability compared to unimodal baselines. This work highlights the potential of multimodal agents to act as personalized treatment companions, advancing adaptive, human-centered healthcare. Evaluation in simulated healthcare communication scenarios shows strong performance with a Usefulness Metric of 0.78, a Relevance Metric of 0.93, a Hallucination Metric of 0.3, a Contain Metric of 0.88, an Equals Metric of 0.88, and a Sentence BLEU score of 0.98. These results highlight the potential of multimodal agents to act as personalized treatment companions, advancing adaptive, human-centered healthcare.
Submission Number: 80
Loading