ShadowFlow: Learning Ambient Shadow Motion as a Non-Visual State Modality for Embodied Language Interaction

ShadowFlow: Learning Ambient Shadow Motion as a Non-Visual State Modality for Embodied Language Interaction

ACL ARR 2026 January Submission3375 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Device Free Localization, Ambient Light Sensing, Shadow Motion Representation, Multi View Fusion, Embodied State Perception

Abstract: Language grounded embodied agents require accurate and continuous human state localization in indoor environments, but camera based tracking is often unacceptable in privacy sensitive applications. Existing device free approaches under unmodulated light lack a structured motion representation that can support sparse sensing and multi view sequence learning. To address this gap, we present ShadowFlow, a non imaging framework that infers continuous 2D trajectories from ambient illumination using sparse photodiode (PD) arrays without active modulation or visual capture. ShadowFlow lifts sparse PD readings into a differentiable grayscale shadow field on a virtual wall and derives a compact shadow flow tensor using lightweight optical flow operators. Since shadow deformation is view dependent and spatially heterogeneous, ShadowFlow encodes each view with attention parallel encoders and performs recurrent fusion to aggregate complementary spatial cues for trajectory regression. On 927 minutes of real world recordings from seven participants in two indoor layouts, ShadowFlow achieves centimeter level accuracy with a 2.35 cm mean localization error and supports real time inference on embedded hardware. The results indicate that ambient shadow flow provides a practical non visual motion modality that supports cross modal grounding for embodied language interaction and robotic perception.

Paper Type: Long

Research Area: Low-resource Methods for NLP

Research Area Keywords: vision language navigation, cross modal pretraining, cross modal application, multimodal applications, multimodal grounding, cross modal information extraction

Contribution Types: Model analysis & interpretability, Data analysis, Theory

Languages Studied: English

Submission Number: 3375

Loading