Abstract: Predicting 3D Semantic Scene Graphs (3DSSG) is vital for understanding complex scenes by constructing structured representations. Current methods struggle with significant granularity discrepancies among instances, often relying on features at a single scale, which hampers their ability to perceive and interact with differently sized instances. To tackle this challenge, we introduce GranSSG, a novel approach that integrates volumetric granular awareness into 3DSSG prediction. Central to GranSSG is the Volumetric Pooling block, which aggregates features from multiple instance volumes, enhancing the representation of instance patterns across different granularities. Complementing this, the Granularity Transformer block dynamically directs attention to instance features across various network layers, ensuring precise perception of instances regardless of their granularity. Furthermore, the Cross-Granularity Correlation Transformer block mitigates performance degradation in instance pair relationship prediction by adaptively fusing hybrid features from different granularities, providing a comprehensive representation of instance pairs. Extensive evaluations on the challenging 3DSSG benchmark demonstrate that GranSSG significantly enhances prediction performance, setting a new state-of-the-art in 3DSSG prediction.
Loading