LUMIA: A Handheld Vision-to-Music System for Real-Time, Embodied Composition

Connie Cheng; Chung-Ta Huang

LUMIA: A Handheld Vision-to-Music System for Real-Time, Embodied Composition

Connie Cheng, Chung-Ta Huang

Published: 27 Sept 2025, Last Modified: 09 Nov 2025NeurIPS Creative AI Track 2025EveryoneRevisionsBibTeXCC BY 4.0

Track: Paper

Keywords: Creative AI, Multimodal Generation, Tangible Interfaces, Vision-to-Audio, Generative Music, Human-AI Interaction, Large Language Objects, Industrial Design.

TL;DR: LUMIA is a handheld device that uses a Vision Language Model to guide music generation, creating coherent loops from visual samples to enable real time composition.

Abstract: Most digital music tools emphasize precision and control, but often lack support for tactile, improvisational workflows grounded in environmental interaction. Lumia addresses this by enabling users to "compose through looking", transforming visual scenes into musical phrases using a handheld, camera-based interface and large multimodal models. A vision-language model (GPT-4V) analyzes captured imagery to generate structured prompts, which, combined with user-selected instrumentation, guide a text-to-music pipeline (Stable Audio). This real-time process allows users to frame, capture, and layer audio interactively, producing loopable musical segments through embodied interaction. The system supports a co-creative workflow where human intent and model inference shape the musical outcome. By embedding generative AI within a physical device, Lumia bridges perception and composition, introducing a new modality for creative exploration that merges vision, language, and sound. It repositions generative music not as a task of parameter tuning, but as an improvisational practice driven by contextual, sensory engagement.

Submission Number: 157

Loading