Cross-validation for Geospatial Data: Estimating Generalization Performance in Geostatistical Problems

Published: 04 Oct 2023, Last Modified: 04 Oct 2023Accepted by TMLREveryoneRevisionsBibTeX
Abstract: Geostatistical learning problems are frequently characterized by spatial autocorrelation in the input features and/or the potential for covariate shift at test time. These realities violate the classical assumption of independent, identically distributed data, upon which most cross-validation algorithms rely in order to estimate the generalization performance of a model. In this paper, we present a theoretical criterion for unbiased cross-validation estimators in the geospatial setting. We also introduce a new cross-validation algorithm to evaluate models, inspired by the challenges of geospatial problems. We apply a framework for categorizing problems into different types of geospatial scenarios to help practitioners select an appropriate cross-validation strategy. Our empirical analyses compare cross-validation algorithms on both simulated and several real datasets to develop recommendations for a variety of geospatial settings. This paper aims to draw attention to some challenges that arise in model evaluation for geospatial problems and to provide guidance for users.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: We believe we have addressed all of the requested changes. While not a specifically requested change, there were comments about the length of the paper, which we agree is (still) long. If the reviewers deem it critical to cut length, we can move much of Section 7.1.1 to the supplemental material. Here is a list of the changes: - In the 1st revision (submitted on July 9th) 1. In section 4 (page 6), we added more details about how to derive from Eqn. 3 to Eqn. 4. 2. In section 5.2 (page 8), we created Table 2 summarizing the characteristics of all five cross-validation methods and their intended scenarios. 3. In section 8 (page 22), we added more discussion about the p-value of the Cramer test in practice. 4. In section 2 (page 4), we added more background about RuLSIF. 5. In section 8 (page 22), we tied the recommendations more clearly to the scenarios. 6. In section 7 (page 9), we added how to quantify the ground truth test errors for real datasets. 7. In section 1 (page 2), we highlighted the bird SDM use case. 8. We restated the acronyms of various cross-validation methods throughout, like in sec 5.2, sec 7.1.2, sec 8. - In the 2nd revision (submitted on July 13th) 9. In section 4 (page 6), we elaborated on both LHS and RHS derivations, and emphasized that $T_j$ is outside $T$ . - In the CRV version (submitted on Sep. 18th) 10. added a link to the GitHub repository which hosts code and data of this paper 11. moved a few figures and tables to convenient places for easy reference 12. fixed typos and grammar mistakes 13. added acknowledgements 14. deanonymized authors
Assigned Action Editor: ~Mauricio_A_Álvarez1
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Submission Number: 1149