Tackling Real-World Complexity: Hierarchical Modeling and Dynamic Prompting for Multimodal Long Document Classification
Abstract: With the rapid growth of internet content, multimodal long document data has become increasingly prominent, drawing significant attention from researchers. However, most existing methods primarily focus on scenarios where all modalities are present, often overlooking more challenging and realistic cases involving missing image modality. To address this limitation, we propose a robust multimodal long document classification (MLDC) framework that integrates hierarchical modeling and dynamic prompting to handle complex multimodal long document data. Our approach begins by leveraging hierarchical modeling combined with an Adaptive Correlation Multimodal Transformer (ACMT) to effectively capture relationships between text and images at both section and sentence levels. We also introduce a Dynamic Prompt Generation (DPG) module at both levels to enhance the model’s robustness in handling missing image data. By evaluating sample uncertainty, the DPG module dynamically adjusts both the number of prompts and the prompts themselves, allowing the model to better adapt to the varying needs of different samples. Finally, a Hierarchical Heterogeneous Graph (HHG) is introduced to enhance feature interactions across levels, further improving the coherence and accuracy of the model. Extensive experiments on four multi-modal long document datasets demonstrate that our model shows superior performance compared to existing state-of-the-art MLDC classification methods in various conditions.
Loading