A Unified Understanding and Generation Framework for Ego-Centric Tracing in Dynamic World

A Unified Understanding and Generation Framework for Ego-Centric Tracing in Dynamic World

ICLR 2026 Conference Submission19530 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Scene Understanding, Ego-Centric Tracing

Abstract: Ego-centric tracing with sparse yet informative cues is a fundamental capability of embodied agents operating in complex and dynamic environments. However, existing approaches typically address cue understanding and cue generation in isolation, which limits their synergy and significantly constrains agents’ ability to perceive and act effectively. To overcome this limitation, we propose a \textbf{Uni}fied \textbf{U}nderstanding–\textbf{G}eneration framework \textbf{(Uni-UG}) that tightly integrates a multi-granularity disentangled representation learning module for understanding with a controllable clue generation module. Specifically, a shared encoder first extracts features from multimodal inputs and interactive feedback, while a temporal attention mechanism dynamically adapts the representation to the evolving environment. The understanding module then disentangles these features into multi-granular sub-representations, capturing rich categorical and fine-grained attribute-level information of potential clues. Conditioned on these outputs and specified control signals, the generation module produces supplementary clue information. A joint loss function is employed to simultaneously optimize understanding accuracy and generation quality, thereby enforcing semantic consistency between the two: the understanding module guides clue generation through extracted categories, while the generated clues in turn iteratively refine the overall understanding process. Extensive experiments conducted across multiple challenging datasets validate the effectiveness and generalizability of Uni-UG framework.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 19530

Loading