Abstract: Textual graphs (TGs) are graphs whose nodes correspond to text (sentences or documents), which are widely prevalent. The representation learning of TGs involves two stages: \((i)\) \textit{unsupervised feature extraction} and \((ii)\) \textit{supervised graph representation learning}. In recent years, extensive efforts have been devoted to the latter stage, where Graph Neural Networks (GNNs) have dominated. However, the former stage for most existing graph benchmarks still relies on traditional feature engineering techniques. This motivates us to investigate the outcomes of enhancing only the text embeddings in benchmark models. While it is anticipated that advanced text embeddings will boost GNN performance, key questions remain underexplored: the extent of this improvement, particularly how advanced text features can enhance a \textit{rudimentary} GNN architecture.} Therefore, in this work, we investigate the impact of enhancing benchmark text embeddings exclusively and evaluate it on two fundamental graph representation learning tasks: \textit{node classification} and \textit{link prediction}. Through extensive experiments, we show that better text embeddings significantly improves the performance of various GNNs \textit{especially basic GNN baselines}, on multiple graph benchmarks. Remarkably, when additional supporting text provided by large language models (LLMs) is included, \textit{a simple two-layer GraphSAGE} trained on an ensemble of text embeddings achieves an accuracy of 77.48\% on \texttt{OGBN-Arxiv}, comparable to state-of-the-art (SOTA) performance obtained from far more complicated GNN architectures. We will release our code and generated node features soon.
Paper Type: long
Research Area: NLP Applications
Contribution Types: NLP engineering experiment
Languages Studied: english
0 Replies
Loading