Investigating the impact of missing value handling on Boosted trees and Deep learning for Tabular data: A Claim Reserving case study

Published: 03 Jan 2025, Last Modified: 03 Jan 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: While deep learning (DL) performance is exceptional for many applications, there is no consensus on whether DL or gradient boosted decision trees (GBDTs) are superior for tabular data. We compare TabNet (a DL model for tabular data), two simple neural networks inspired by ResNet (a DL model) and Catboost (a GBDT model) on a large UK insurer dataset for the task of claim reserving. This dataset is of particular interest for its large amount of informative missing values which are not missing completely at random, highlighting the impact of missing value handling on accuracy. Under certain missing value schemes a carefully optimised simple neural network performed comparably to Catboost with default settings. However, using less-than-minimum imputation, Catboost with default settings substantially outperformed carefully optimised DL models, achieving the best overall accuracy. We conclude that handling missing values is an important, yet often overlooked, step when comparing DL to GBDT algorithms for tabular data.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Preparation of Camera ready version. Includes: - Removal of revision formatting - Additional emphasis in the Introduction and Abstract on not MCAR value of dataset - Additional pointer to Tabzilla MCAR replication in the Conclusion - Additional experiments on combining Binarize and LT Min Impute as "missingness masks" in Appendix A.7, referenced in Section 4.2.1
Assigned Action Editor: ~Dennis_Wei1
Submission Number: 3193
Loading