## 📄 Rethinking Graph-Based Document Classification: Learning Data-Driven Structures Beyond Heuristic Approaches

**Authors**: [anonymized list of authors]  

---

### 📝 Abstract

In document classification, graph-based models effectively capture document structure and overcome sequence length limitations, enhancing contextual understanding. However, existing graph document representations often rely on heuristics, domain-specific rules, or expert knowledge. We propose a novel method to learn data-driven graph structures, eliminating the need for manual design and reducing domain dependence. Our approach constructs homogeneous weighted graphs with sentences as nodes, while edges are learned via a self-attention model that identifies dependencies between sentence pairs. A statistical filtering strategy retains only strongly correlated sentences, improving graph quality while reducing the graph size. Experiments on three datasets show that learned graphs consistently outperform heuristic-based baselines and recent small language models, achieving higher accuracy and $F_1$ score. Furthermore, our study demonstrates the effectiveness of the statistical filtering in improving classification robustness, highlighting the potential of automatic graph generation over traditional heuristic approaches and opening new directions for broader applications in NLP.

---

### 📁 Repository Structure

```bash
├── config/
│   ├── GNN_config_heuristic_graphs/        # Config files including parameters for training GATs on heuristic-based graphs
│   └── GNN_config_learned_graphs/          # Config files including parameters for training GATs on learned graphs
│   ├── MHAClassifier_config/               # Config files including parameters for training our attention-based model.
├── data/                                   # Raw and processed datasets
├── GNN_Results_Classifier/                 # Results obtained from our runs
│   ├── Attention/                          # Results from learned graphs
│   └── Heuristic_uni/                      # Results from heuristic-based graphs
├── imgs/                                   # Figures with adjency matrix examples of each dataset
├── src/                                    # Source code for training and evaluation
│   ├── data/                               # data loaders and utils
│   └── graphs/                             # graph-based architectures
│   └── models/                             # core training models
│   └── pipeline/                           # connector for text-graph models  
├── README.md                            
├── requirements.txt                        # Python dependencies
├── train_GNN.py                            # GNN Training script 
├── train_MHAClassifier.py                  # MHA-based Classifier Training script
```

### 📊 Datasets
Due to file size restrictions, the preprocessed versions of Hyperpartisan News Detection (HND) and BBC News datasets are available in the `data/` folder. In turn, the arXiv dataset can be downloaded from: [URL](https://url/will/be/released/upon/acceptance).

