Quat-DGNet: Enhancing 3D Dense Captioning with Quaternion-Based Spatial Offsets and Dynamic Neighborhood Graphs

Published: 2024, Last Modified: 23 Jan 2026PRCV (6) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: 3D dense captioning aims at generating more detailed and accurate descriptions for objects in a 3D scene. Since the one-stage (detect-and-describe) model does not have a detector to provide proposals as local information to the encoder, it leads to the problem of imbalance between global and local information in the encoding stage. To solve this problem, we propose Quat-DGNet, a novel model to complement the problem of insufficient local information. Specifically, we propose Quat-B and DNG to capture positional offsets and local relationship graph modeling. Quat-B does this by constraining the point cloud coordinates to a quaternion space, the quaternion representation being valid for parameterizing smooth rotations and spatial transformations in vector space. We design a loss function to more accurately describe the offset and make the point cloud move towards the object. DNG supplements local geometric features by constructing dynamic point cloud relationship maps, which can maintain alignment invariance and capture local geometric features, thus improving the diversity and quality of the model-generated descriptions. Comprehensive experiments demonstrate that our model outperforms existing efficient models in performance.
Loading