Abstract: Decision trees are a popular classification model in machine learning due to their interpretability and performance. Decision-tree classifiers are traditionally constructed using greedy heuristic algorithms that do not provide guarantees regarding the quality of the resultant trees. In contrast, a recent line of work employed exact optimization techniques to construct optimal decision-tree classifiers. However, most of these approaches are designed for datasets with binary features. While numeric and categorical features can be transformed into binary features, this transformation can introduce a large number of binary features and may not be efficient in practice. In this work, we present a SAT-based encoding for decision trees that directly supports non-binary data and use it to solve two well-studied variants of the optimal decision tree problem. Furthermore, we extend our approach to support cost-sensitive learning of optimal decision trees and introduce tree pruning constraints to reduce overfitting. We perform extensive experiments based on real-world and synthetic datasets that show that our approach obtains superior performance to state-of-the-art exact techniques on non-binary datasets and has significantly smaller memory consumption. We also show that our extension for cost-sensitive learning and our tree pruning constraints can help improve the prediction quality on unseen test data.
0 Replies
Loading