Abstract: Highlights•Cross-modal graphs are dynamically constructed to capture image-text interactions.•Adapt graph structure to specific tasks via label-driven reasoning.•Dynamically learn and adjust its internal representation based on specific tasks.•Cutting-edge performance on cross-modal textbook reasoning tasks.
Loading