Keywords: 3D Question Answering, Partial Information Decomposition, Multimodal Interpretability, Information-theoretic Decomposition
Abstract: 3D Question Answering remains largely opaque, with existing approaches functioning as black boxes that provide no insight into how different modalities contribute to spatial reasoning. This lack of interpretability limits trust and understanding in critical applications like robotics and autonomous systems. We present ATOM (\textbf{A}daptive \textbf{T}ask-aware m\textbf{O}dular \textbf{M}odel), the first information-theoretic framework that operationalizes Partial Information Decomposition (PID) to achieve fully interpretable 3D question answering. ATOM explicitly decomposes multimodal interactions into four theoretically-grounded information atoms: point cloud uniqueness, image uniqueness, redundancy (shared cross-modal information), and synergy (emergent complementary information). Our framework provides transparent reasoning through a Query-driven View Aggregator for geometrically consistent visual features, a Contextual Grounding Module for description-guided visual grounding, a Question-aware PID module with theoretically-grounded regularization, and a Dynamic Atom Modulation mechanism that provides direct, quantifiable interpretability of each atom's contribution. Extensive experiments on ScanQA and SQA3D datasets demonstrate that ATOM achieves the performance comparable to prior work and, to the best of our knowledge, are the first to enable transparency in 3D reasoning. Our analysis reveals that different question types systematically rely on distinct information patterns that align with human spatial cognition.n.
Supplementary Material: zip
Primary Area: interpretability and explainable AI
Submission Number: 10776
Loading