The Final-Stage Bottleneck: A Systematic Dissection of the R-Learner for Network Causal Inference

The Final-Stage Bottleneck: A Systematic Dissection of the R-Learner for Network Causal Inference

TMLR Paper6646 Authors

25 Nov 2025 (modified: 23 Feb 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: The R-Learner is a powerful, theoretically-grounded framework for estimating heterogeneous treatment effects, prized for its robustness to nuisance model errors. However, its application to network data—where causal heterogeneity may be driven by graph structure—presents critical and underexplored challenges to its core assumption of a well-specified final-stage model. In this paper, we conduct a large-scale, multi-seed empirical study to systematically dissect the R-Learner framework on graphs. Our results suggest that for network-dependent effects, a critical driver of performance is the inductive bias of the final-stage CATE estimator, a factor whose importance can dominate that of the nuisance models. Our central finding is a systematic quantification of a "representation bottleneck": we demonstrate empirically and through a constructive theoretical example that graph-blind final-stage estimators, being theoretically misspecified, exhibit significant under-performance (MSE > 4.0, p < 0.001 across all settings). Conversely, we show that an R-Learner with a correctly specified, end-to-end graph-aware architecture (the "Graph R-Learner") achieves a significantly lower error. Furthermore, we provide a comprehensive analysis of the framework’s properties. We identify a subtle "nuisance bottleneck" and provide a mechanistic explanation for its topology dependence: on hub-dominated graphs, graph-blind nuisance models can partially capture concentrated confounding signals, while on graphs with diffuse structure, a GNN’s explicit aggregation becomes critical. This is supported by our analysis of a "Hub-Periphery Tradeoff," which we connect to the GNN over-squashing phenomenon. Our findings are validated across diverse synthetic and semi-synthetic benchmarks, where the R-Learner framework also significantly outperforms a strong, non-DML GNN T-Learner baseline. We release our code as a comprehensive and reproducible benchmark to facilitate future research on this critical "final-stage bottleneck."

Submission Type: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: To ensure the paper is accessible to a broad machine learning audience, we have implemented the following structural updates: DML and R-Learner Primer: We have expanded Section 2.1 to include an intuitive primer on the Double/Debiased Machine Learning (DML) framework. We explain the "residual-on-residual" logic as a method to isolate clean causal signals by "subtracting out" confounding captured by the nuisance models. Model Renaming: Following the suggestion of Reviewer 8wEX, we have renamed the "Sanity Check (MLP+GNN)" model to the Hybrid R-Learner throughout the manuscript. This reflects its architecture—graph-blind nuisance models paired with a graph-aware final stage—while clarifying its role in isolating the nuisance bottleneck. Glossary of Terms: We have explicitly defined all abbreviations at their first occurrence: DGP: Data-Generating Process. HTE: Heterogeneous Treatment Effects. GCN: Graph Convolutional Network. GAT: Graph Attention Network. Statistical Methodology and Rigor We have added a Technical Appendix to clarify the robustness of our statistical claims:Paired t-tests: The reported $p$ -values are derived from paired (relational) t-tests (ttest_rel) comparing model Mean Squared Errors (MSEs) across 30 seeds.Seed Independence: Each seed represents a fully independent experimental replication. For every seed, the entire environment—including node features ($X$), topology ($edge\\_index$), and treatment/outcome assignments ($T, Y$)—is re-simulated from scratch .+1Hyperparameter Fairness: To ensure no bias toward our proposed model, all estimators were trained using identical hyperparameters and fixed optimization budgets unpacked from the same configuration file.Multiple Comparisons: Given the extreme significance levels observed ($p = 5.43 \times 10^{ -17 }$ ), findings remain statistically significant even under conservative multiple-comparison adjustments like the Bonferroni correction. Added Mechanistic and Empirical Clarifications.

Assigned Action Editor: ~Jiwei_Zhao1

Submission Number: 6646

Loading