Riemann-Lebesgue Forest for Regression

TMLR Paper4191 Authors

12 Feb 2025 (modified: 04 Jun 2025)Decision pending for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: We propose a novel ensemble method called Riemann-Lebesgue Forest (RLF) for regression. The core idea in RLF is to mimic the way how a measurable function can be approximated by partitioning its range into a few intervals. With this idea in mind, we develop a new tree learner named Riemann-Lebesgue Tree (RLT) which has a chance to perform ``Lebesgue'' type cutting,i.e., splitting the node from response Y at certain non-terminal nodes. In other words, we introduce the ``splitting type randomness'' in training our ensemble method. Since the information of Y is unavailable in the prediction step, weak local models such as small random forests or decision trees are fit in non-terminal nodes with ``Lebesgue'' type cutting to determine which child node should we proceed to. We show that the optimal ``Lebesgue'' type cutting results in larger variance reduction in response Y than ordinary CART cutting (an analogue of Riemann partition) in fitting a base tree. Such property is beneficial to the ensemble part of RLF, which is verified by extensive experiments. We also establish the asymptotic normality of RLF under different parameter settings. Two one-dimensional examples are provided to illustrate the flexibility of RLF. The competitive performance of RLF with small local random forests against original random forest (RF) and boosting methods such as XGboost is demonstrated by extensive experiments in simulation data and real-world datasets. Additional experiments further illustrate that RLF with local decision trees could achieve decent performance comparable to that of RF with less running time, especially in large datasets.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: * Removed or weakened the overstated claims. For example, we removed the statement like "...Riemann-Lebesgue Forest (RLF) which has superior performance than ordinary Random Forest in regression task", which is pointed out by the reviewer on p1 * Emphasized the utility of local models in Algorithm 1. * Rephrased the statements of the performance of RLF by mentioning the use of the local models throughout the whole article. All changes are highlighted in red color.
Assigned Action Editor: ~Benjamin_Guedj1
Submission Number: 4191
Loading