Investigating the impact of missing value handling on Boosted trees and Deep learning for Tabular data: A Claim Reserving case study

TMLR Paper3193 Authors

15 Aug 2024 (modified: 27 Nov 2024)Decision pending for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: While deep learning (DL) performance is exceptional for many applications, there is no consensus on whether DL or gradient boosted decision trees (GBDTs) are superior for tabular data. We compare TabNet (a DL model for tabular data), two simple neural network inspired by ResNet (a DL model) and Catboost (a GBDT model) on a large UK insurer dataset for the task of claim reserving. This dataset contains a high amount of informative missing values. We use this application to shed light on the impact of missing value handling on accuracy. Under certain missing value schemes a carefully optimised simple neural network model performed comparably to Catboost with default settings. However, using less-than-minimum imputation, Catboost with default settings substantially outperformed carefully optimised DL models - achieving the best overall accuracy. We conclude that handling missing values is an important, yet often overlooked, step when comparing DL to GBDT algorithms for tabular data.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: # Rebuttal update We again thank all of the reviewers for their time and insightful comments. We have provided an updated manuscript with new text in blue and removed text in red, with a strike-through font. We also add line numbers to aid in any further discussion on specific wording. All cosmetic formatting (colours, strikethrough and line numbers) will of course be removed for the camera ready version should the paper be approved. Our main changes are: - Expansion of results to include Mean Imputation and a replica of the RTDL ResNet model. - Expansion of discussion in light of new results. - Fix rendering of discussion on training speeds and relocation into Appendix A.2 - Inclusion of new Appendices: i) Appendix A.3: Detailed differences of our ResNet with Gorishniy et al. ii) Appendix A.4: using Optuna HPO instead of grid search iii) Appendix A.5: choosing to perform HPO independent of imputation scheme iv) Appendix A.6: Using TabZilla to analyse more datasets. We include rewording for clarity throughout the manuscript. [1] Yury Gorishniy, Ivan Rubachev, Valentin Khrulkov, and Artem Babenko. Revisiting deep learning models for tabular data. Advances in Neural Information Processing Systems, 34:18932–18943, 2021
Assigned Action Editor: ~Dennis_Wei1
Submission Number: 3193
Loading