SAGE: A Unified Framework for Generalizable Object State Recognition with State-Action Graph Embedding

Yuan Zang, Zitian Tang, Junho Cho, Jaewook Yoo, Chen Sun

Published: 18 Sept 2025, Last Modified: 24 Sept 2025Advances in Neural Information Processing Systems 38 (NeurIPS 2025)EveryoneCC BY 4.0

Abstract: Recognizing the physical states of objects and their transformations within videos is crucial for structured video understanding and enabling robust real-world applications, such as robotic manipulation. However, pretrained vision-language models often struggle to capture these nuanced dynamics and their temporal context, and specialized object state recognition frameworks struggle with generalizing to unseen actions or objects. We introduce SAGE (State-Action Graph Embeddings), a novel framework that offers a unified model of physical state transitions by decomposing states into fine-grained, language-described visual concepts that are sharable across different objects and actions. SAGE initially leverages Large Language Models to construct a State-Action Graph, which is then multimodally refined using Vision-Language Models. Extensive experimental results show that our method significantly outperforms existing baselines, generalizes effectively to unseen objects and actions in open-world settings. Our method improves the prior state-of-the-art by as much as 14.6% on novel state recognition with less than 5% of its inference time. Our code and data will be publicly released.