Keywords: Scene Understanding, Ego-Centric Tracing
Abstract: Ego-centric tracing with sparse yet informative cues is a fundamental capability of embodied agents operating in complex and dynamic environments. However, existing approaches typically address cue understanding and cue generation in isolation, which limits their synergy and significantly constrains agents’ ability to perceive and act effectively. To overcome this limitation, we propose a \textbf{Uni}fied \textbf{U}nderstanding–\textbf{G}eneration framework \textbf{(Uni-UG}) that tightly integrates a multi-granularity disentangled representation learning module for understanding with a controllable clue generation module. Specifically, a shared encoder first extracts features from multimodal inputs and interactive feedback, while a temporal attention mechanism dynamically adapts the representation to the evolving environment. The understanding module then disentangles these features into multi-granular sub-representations, capturing rich categorical and fine-grained attribute-level information of potential clues. Conditioned on these outputs and specified control signals, the generation module produces supplementary clue information. A joint loss function is employed to simultaneously optimize understanding accuracy and generation quality, thereby enforcing semantic consistency between the two: the understanding module guides clue generation through extracted categories, while the generated clues in turn iteratively refine the overall understanding process. Extensive experiments conducted across multiple challenging datasets validate the effectiveness and generalizability of Uni-UG framework.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 19530
Loading