Track: Extended abstracts (2 pages)
Keywords: Vision Language Model, Scene Understanding
Abstract: Understanding complex environments requires capturing the arrangement of objects, their interactions, and contextual information. Early symbolic and data-driven approaches are limited by rigid designs or narrow applicability. Recent vision-language models (VLMs) provide rich priors and flexible reasoning, supporting the generation of structured scene descriptions that handle compositional arrangements, diverse categories, and realistic constraints. However, challenges remain in precise spatial reasoning, consistent object placement, and maintaining coherent geometry. We present a VLM-driven pipeline for scene representation generation, analyze its shortcomings through a case study, and suggest avenues for future enhancements.
Confirmation: I have read and agree with the workshop's policy on behalf of myself and my co-authors.
Submission Number: 17
Loading