Rethinking Graph-Based Document Classification: Learning Data-Driven Structures Beyond Heuristic Approaches

17 Sept 2025 (modified: 06 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Document Classification, Graph-based Document Representation, Self-Attention, Representation Learning, Graph Neural Networks
TL;DR: A data-driven framework for document classification that automatically induces graph structures, eliminating the need for manually-designed heuristics and task-specific rules, reducing domain dependence.
Abstract: In document classification, graph-based models effectively capture document structure and overcome sequence length limitations, enhancing contextual understanding. However, existing graph document representations often rely on heuristics, domain-specific rules, or expert knowledge. We propose a novel method to learn data-driven graph structures, eliminating the need for manual design and reducing domain dependence. Our approach constructs homogeneous weighted graphs with sentences as nodes, while edges are learned via a self-attention model that identifies dependencies between sentence pairs. A statistical filtering strategy retains only strongly correlated sentences, improving graph quality while reducing the graph size. Experiments on three datasets show that learned graphs consistently outperform heuristic-based baselines and recent small language models, achieving higher accuracy and $F_1$ score. Furthermore, our study demonstrates the effectiveness of the statistical filtering in improving classification robustness, highlighting the potential of automatic graph generation over traditional heuristic approaches and opening new directions for broader applications in NLP.
Supplementary Material: zip
Primary Area: learning on graphs and other geometries & topologies
Submission Number: 8984
Loading