
\documentclass[10pt]{article} % For LaTeX2e
% \usepackage{tmlr}
% If accepted, instead use the following line for the camera-ready submission:
\usepackage[accepted]{tmlr}
% To de-anonymize and remove mentions to TMLR (for example for posting to preprint servers), instead use the following:
% \usepackage[preprint]{tmlr}

% Optional math commands from https://github.com/goodfeli/dlbook_notation.
\input{math_commands.tex}

\usepackage{hyperref}
\usepackage{url}


\usepackage{booktabs}         % professional-quality tables
\usepackage{multirow}         % 

\usepackage[linesnumbered,ruled,vlined]{algorithm2e}

\usepackage[width=\textwidth, font=small]{caption}     %control caption width
% tiny, scriptsize, footnotesize, small, normalsize, large, Large, LARGE, huge, Huge
\usepackage{adjustbox}        % scale the table to a specific width
\usepackage{longtable}
\usepackage{tabularx}         % tabular with paragraph-like column
\usepackage{xltabular}        % long tables with auto-width columns that can span multiple pages
% \usepackage{xfrac}            % \sfrac command


\title{Optimal Pattern Detection Tree for Symbolic Rule-Based Classification}

% Authors must not appear in the submitted version. They should be hidden
% as long as the tmlr package is used without the [accepted] or [preprint] options.
% Non-anonymous submissions will be rejected without review.

\author{\name Young-Chae Hong \email hongych@amazon.com \\
      \addr Amazon\\
      Seattle, USA
      \AND
      \name Yangho Chen \email yanghoc@amazon.com \\
      \addr Amazon\\
      Seattle, USA}

% The \author macro works with any number of authors. Use \AND 
% to separate the names and addresses of multiple authors.

\newcommand{\fix}{\marginpar{FIX}}
\newcommand{\new}{\marginpar{NEW}}

\def\month{04}  % Insert correct month for camera-ready version
\def\year{2026} % Insert correct year for camera-ready version
\def\openreview{\url{https://openreview.net/forum?id=RJ6eMDcDCv}} % Insert correct link to OpenReview for camera-ready version


\begin{document}

\maketitle

\begin{abstract}

Pattern discovery in data plays a crucial role across diverse domains, including healthcare, risk assessment, and machinery maintenance. In contrast to black-box deep learning models, symbolic rule discovery emerges as a key data mining task, generating human-interpretable rules that offer both transparency and intuitive explainability. This paper introduces the Optimal Pattern Detection Tree (OPDT), a rule-based machine learning model based on novel mixed-integer programming to discover a single optimal pattern in data through binary classification. To incorporate prior knowledge and compliance requirements, we further introduce the Branching Structure Constraints (BSC) framework, which enables decision makers to encode domain knowledge and constraints directly into the model. This optimization-based approach discovers a hidden underlying pattern in datasets, when it exists, by identifying an optimal rule that maximizes coverage while minimizing the false positive rate due to misclassification. Our computational experiments show that OPDT discovers a pattern with optimality guarantees on moderately sized datasets within reasonable runtime.

\end{abstract}

\section{Introduction}

Data mining is the process of discovering and extracting hidden patterns from datasets to gain valuable insights and support decision-making. In particular, symbolic rule discovery is an important data mining task that generates human-interpretable rules in a natural and intuitive manner. However, recent developments in artificial intelligence (AI) and machine learning (ML) have driven attention toward black-box models, particularly deep learning approaches. This trend has led, in turn, to a growing demand for developing algorithms that inherently learn interpretable white-box models, especially in high-stakes domains such as healthcare, personalized medicine, criminal justice, and financial risk assessment, where decisions can significantly impact human lives \citep{rudin2019stop, rudin2022interpretable}. The exponential growth in healthcare data collection, driven by advanced monitoring technologies and digital health systems, has resulted in a growing volume of biological and medical data at unprecedented speed and scale \citep{luo2016big}. While the tremendous amounts of data—collected from electronic health records or monitoring devices can be utilized for more effective and enhanced clinical decision making \citep{rea2012building}, it poses significant challenges in efficiently analyzing and extracting useful information and actionable insights from such overwhelming volumes of data \citep{najarian2013biomedical}. 

Over the last decade, various pattern recognition techniques have been developed for medical decision support systems to help clinicians effectively utilize the overwhelming amount of healthcare data. These techniques are applied to biomedical data for automated clinical diagnosis or therapeutic support, integrating data-driven knowledge and patient-specific information to enhance cost-effective healthcare delivery \citep{moja2014effectiveness}. The development of novel pattern recognition methods and algorithms with high performance, in terms of accuracy and time complexity, improves healthcare delivery by allowing clinicians to make a better-informed and timely decision \citep{asgari2019pattern}. To take reliable actions in high-stakes domains, the patterns extracted from data must be understood by human domain experts, unlike the opaque solutions provided by black-box ML approaches \citep{rudin2019stop}. If domain experts do not fully comprehend the reasoning behind these patterns, there is a risk of making misinformed or potentially harmful decisions in life-threatening situations. Therefore, developing human-readable rules is crucial for addressing emerging challenges in critical decision-making processes and ensuring accountability, transparency, and trust in automated systems. 

Rule-based machine learning (RBML) models offer a promising solution by combining the interpretability of traditional rule-based systems with the automated learning capabilities of ML approaches. Unlike conventional black-box ML models, RBML generates explicit decision rules that can be readily understood and validated by domain experts. Furthermore, for high-risk tasks requiring accountability, not only is interpretability crucial, but models must also comply with domain expert-specified constraints to be trustworthy \citep{nanfack2022constraint}. For example, learned patterns may not make sense from a medical or clinical perspective, as algorithms typically consider only information extracted from medical datasets without incorporating domain knowledge \citep{lopez2007increasing}. Therefore, ML models can become more trustworthy and reliable if they incorporate additional domain knowledge in the form of constraints. However, existing RBML methods are designed to extract a ruleset (i.e., a collection of multiple rules) rather than a single optimal pattern in data. In many real-world scenarios, such as an epidemic outbreak, credit fraud, or intrusion detection, positive samples originate from a single underlying cause, and a single high-precision rule is both sufficient and more actionable than a complex ruleset. To the best of our knowledge, no prior method has conducted research on extracting a single optimal pattern.

This paper proposes a novel algorithm to extract a single rule through binary classification that adheres to user-defined structural constraints, ensuring high interpretability while incorporating prior domain knowledge. We develop the Optimal Pattern Detection Tree (OPDT) based on optimal decision trees with \emph{branching structure constraints} that control: 1) the tree's topology through split decisions, and 2) the feature group assignments for each branch node. This optimization-based rule extracting approach discovers a hidden underlying pattern in datasets, when it exists, by identifying an optimal rule that maximizes coverage while minimizing the false positive rate due to misclassification. The remainder of the paper is structured as follows. In Section 2, we discuss the related work regarding the interpretable RBML and optimal decision tree (ODT). In Section 3, we present our proposed approach, OPDT to extract an optimal pattern from a dataset. In Section 4, we demonstrate the performance of the OPDT over different real-world datasets from the UCI repository and compare it with other RBML methods.

\section{Related Literature}

\subsection{Rule-Based Machine Learning}

Among advanced AI and ML technologies, the ability to extract valuable insights from complex datasets has become crucial for various fields, including healthcare, criminal justice, and financial risk assessment. However, the lack of interpretability in ML models can potentially lead to adverse or even life-threatening consequences \citep{ahmad2018interpretable}. There is a growing demand for interpretable ML that allows end-users to understand the logic and reasoning behind predictions, especially in domains such as healthcare and criminal justice systems. This demand for improved interpretability has also increased to mitigate the risk of making unjustifiable decisions. In particular, studies in domains like healthcare and criminal justice have revealed that ML systems can systematically exhibit unfair biases \citep{burrell2016machine}. Thus, interpretability is needed to ensure such systems are free from bias \citep{hajian2016algorithmic}. Recent research has focused on developing explainable or interpretable AI models \citep{murdoch2019definitions, vilone2021notions} because model interpretability and explainability are crucial for clinical and healthcare practices \citep{ahmad2018interpretable, kolyshkina2021interpretability}, where misclassification costs can be potentially high. To address this, data mining techniques have emerged to discover interpretable rules by generating symbolic rules that humans can readily understand. By providing understandable patterns or rules from training data, we can involve human experts such as doctors or nurses, allowing them to review AI model results and provide feedback or corrections. Consequently, this approach can improve the reliability of life-critical decision-making in healthcare.

Unlike deep learning-based models that often operate as black boxes, RBML is an algorithm that induces a set of symbolic rules from training data. RBML automatically identifies regularities or patterns that can be expressed in the form of ``IF-THEN'' statements. Here, the left hand side (``IF'' part) is called the rule antecedent or condition, which can be expressed in logical operators like disjunction, conjunction, and negation. The right hand side (``THEN'' part) of a rule is called the rule consequent or conclusion. In particular, rule conditions represented in disjunctive normal form (DNF: OR-of-ANDs) or conjunctive normal form (CNF: AND-of-ORs) are intuitively understandable to humans as they mirror natural reasoning processes. For example, a medical diagnosis rule might state: ``IF (patient age > 65 AND white blood cell count > 11,000) OR (body temperature > 100 AND patient has chest pain) THEN (further diagnostic tests for potential infection)''. A natural choice for interpretability is to represent ML output as logical operators like disjunction, conjunction, and negation with the IF-THEN structure, and these can be expressed through rule sets, decision trees, and decision lists. RBML's full transparency and intuitive explainability make it particularly valuable in applications where decision-making processes must be clear and explainable. The principles of RBML are most commonly implemented through the form of rule-based classifiers in modern ML applications. Rule-based classifiers can be seen as a generalization of decision tree classifier since the obtained rules do not need to be represented in the form of a tree, thus being more flexible \citep{palliser2021rrules}. Unlike trees, DNF rules need not be mutually exclusive. These rules are potentially more compact and predictive than trees. Regarding the induced rules, it uses conjunctive antecedents: ``IF condition\_01 AND condition\_02 AND ... THEN conclusion''. Each individual condition, also called a selector, is formed by an attribute-value pair, where the pair refers to a specific feature of the data and its corresponding value. Rule-based classifiers have already demonstrated successful application across diverse domains, including medical diagnosis \citep{asgari2019pattern}, financial fraud detection \citep{ali2022financial}, intrusion detection \citep{lee1999data}, and machine failure diagnosis in manufacturing systems \citep{jiang2009large}.

Research on RBML has introduced various approaches to extract rules from data. First, Rules Extraction System (RULES) \citep{pham1995rules} is a simple algorithm for extracting a set of classification rules from a set of training instances given a set of classes. It follows a general-to-specific approach and enforces perfect precision (100\%) in the generated rules unless there are inconsistencies in the data. RRULES \citep{palliser2021rrules} was proposed as an optimization of RULES by focusing on two key points: the mechanism to detect irrelevant rules and the stopping condition. PRISM \citep{cendrowska1988prism} introduced a unique approach of generating rules by selecting examples of a specific class and iteratively adding conditions until the obtained rule has perfect precision. As a non-ordered and non-incremental algorithm, PRISM builds rules from general to specific patterns, with rule order being irrelevant for predictions. Interpretable Decision Sets (IDS) \citep{lakkaraju2016interpretable} presented a method for generating decision sets which are sets of independent if-then rules. Since each rule can be applied independently, decision sets are simple, concise, and easily interpretable compared to a decision list. Bayesian Rule Sets (BRS) \citep{wang2017bayesian} introduced a novel Bayesian framework for learning rule sets that uses a generative model to incorporate prior knowledge about interpretable rules while maintaining strong predictive performance. Incremental Reduced Error Pruning (IREP) \citep{furnkranz1994incremental} addressed the computational inefficiencies of standard reduced error pruning in rule learning by integrating the pruning process directly into rule learning rather than as a post-processing step. This innovation laid the foundation for RIPPER \citep{cohen1995fast}, which demonstrated that incremental pruning could achieve comparable or better results than standard pruning while being substantially more efficient. RuleFit \citep{friedman2008predictive} extracts rules from an ensemble of trees by automatically detecting interaction effects in the form of decision rules and builds a weighted combination of these rules using L1-regularized optimization over the weights \citep{friedman2004gradient}. SkopeRules \citep{nicolas_goix_2020_4316671} is based on RuleFit's approach, utilizing a Random Forest model to fit class labels. The two methods differ only in their rule pruning strategy: RuleFit uses a linear model whereas SkopeRules heuristically deduplicates overlapping rules.

\subsection{Optimal Classification}

In recent years, researchers have increasingly applied discrete optimization techniques to learn optimal decision trees in ML problems. The problem of learning an optimal decision tree is NP-hard \citep{laurent1976constructing}, which has led to the widespread adoption of greedy heuristics such as Classification and Regression Trees (CART) \citep{breiman1984classification}, which constructs univariate classification trees. However, over the last decades, there has been an overall 800 billion-fold speedup in the computational power of optimization solvers \citep{bertsimas2017optimal}. This astonishing increase in optimization solver performance has made it possible to apply modern mixed-integer optimization (MIO) methods to solve optimal decision trees. An overview of recent developments in optimization techniques for ML is presented in Bottou et al. \citep{bottou2018optimization} and Gambella et al. \citep{gambella2021optimization}.

Recently, \citet{bertsimas2017optimal} introduced optimal classification tree (OCT), which is a mixed-integer programming (MIP) formulation to learn optimal decision trees given a fixed depth. The constraints and variables of the formulation can be decomposed into three sets: those defining the structure of the tree through split decisions at each node, those routing data samples from root to leaf nodes, and those counting misclassifications. The objective function balances minimizing training set misclassifications against tree complexity to preserve interpretability \citep{ales2024new}. The MIP model supports both axis-aligned and oblique splits with exponential complexity in tree depth, and can handle both continuous and categorical features. \citet{gunluk2021optimal} proposed an alternative formulation for optimal decision trees that specializes in categorical features. Their formulation exploits the combinatorial structure of categorical variables, enabling branching on subsets of categorical feature values. However, both models can easily become intractable as the size of training dataset grows. To address this issue, \citet{verwer2019learning} proposed BinOCT, which represents decision thresholds through binary encoding, drastically reducing the number of decision variables. The number of decision variables is largely independent of the training set size. It only depends logarithmically on the number of unique feature values. To further improve scalability, \citet{firat2020column} introduced a new formulation based on root-to-leaf paths for fixed-depth trees, which is solved using a column generation-based heuristic. Finally, \citet{aghaei2024strong} introduced a flow-based MIP formulation that represents optimal classification trees as a maximum flow problem. The flowOCT formulation avoids big-M constraints, resulting in stronger linear programming (LP) relaxations compared to previous OCT models. However, it only works with binary features.

In addition to classification decision trees, recent work has explored discrete optimization techniques to learn boolean rules and rule sets. \citet{su2016learning} developed an integer programming (IP) formulation for interpretable two-level boolean rules, expressed in either CNF or DNF, with a predefined maximum number of clauses, enabling controlled model complexity and enhanced transparency in classification tasks for high-stakes domains such as healthcare and law. \citet{lawless2023interpretable} proposed an IP framework using column generation to learn sparse and interpretable boolean rule sets in DNF form, while incorporating fairness constraints.

\section{Methods}

This section presents a structure-constrained decision tree approach to identify patterns in datasets while adhering to prior domain knowledge and structural constraints defined by domain experts. In rule-based classification, any pattern or rule can be expressed through a nested if-then statement since their logical structure inherently consists of conditions (IF clause) and consequences (THEN clause). On the other hand, decision trees effectively extract meaningful patterns from data and transform them into interpretable rules by linearizing conditions along paths from root to leaf nodes. Therefore, we utilize a decision tree approach that discovers classification patterns through nested if-then-else structures, ensuring all possible combinations of conditions, including negation scenarios.

Most business domains have domain-specific prior knowledge about their data. For example, in medical diagnosis, physicians know that elevated blood pressure and high cholesterol levels are key risk factors for cardiovascular disease. Similarly, in diabetes prediction, medical experts rely on crucial diagnostic indicators such as body mass index (BMI), fasting blood glucose levels, and age. Furthermore, domain experts recognize that features can be naturally grouped based on their relevance. In cardiovascular disease assessment, for instance, blood test results (cholesterol levels, triglycerides, and blood glucose) form one group, while vital signs (blood pressure, heart rate, and body temperature) form another. Even when domain knowledge is unavailable, various ML methods can extract prior information about features. For example, SHAP (SHapley Additive exPlanations) is a model-agnostic method that explains the contribution of each feature to model outputs \citep{lundberg2017unified}. Similarly, Boruta is a feature selection algorithm that identifies which features are statistically significant or relevant to the outcome variable \citep{kursa2010feature}. Additionally, feature importance in decision trees quantifies the relative contribution of each input feature to the model's predictions. Therefore, we can reasonably assume two types of prior knowledge: 1) feature groupings based on feature correlation and 2) feature impact on the target variable based on feature importance. Given the prior knowledge, we propose a structure-constrained decision tree to identify optimal patterns in data. This paper develops an optimization-based model that utilizes prior information from either domain experts or ML algorithms.

\subsection{Structure Constrained Decision Tree}
\label{sec:SCDT}
A structure-constrained decision tree (SCDT) refers to a decision tree where the learning process is guided and restricted by specific constraints imposed on its structure. These constraints aim to improve the interpretability, fairness, compliance, or other desired properties of the resulting tree. In pattern discovery, we incorporate prior domain knowledge into the decision tree through structural constraints, following three steps to detect patterns in data. First, given the prior knowledge from domain experts or feature relevance determined by ML algorithms, we define a feature group as a subset of features that captures a distinct semantic or conceptual aspect of the classification problem, where feature groups are not required to be disjoint and may overlap. For instance, in diabetes diagnosis, laboratory measurements (fasting blood glucose, HbA1c levels) and patient characteristics (age, BMI, family history) can be formed into separate groups, reflecting their distinct roles in diagnosis. As shown in Figure \ref{figure:SCDT}, our framework allows feature groups to be overlapping, providing flexibility in capturing features that may belong to multiple semantic or conceptual categories. Additionally, we define $G_{A}$ as a special feature group containing all features, which serves as a default group when prior domain knowledge is insufficient to establish meaningful feature groupings. Second, we define the topology of the binary tree using the feature groups, which restricts feature availability at each branching node and the depth of the tree. For example, in medical diagnosis, blood test results might be restricted to upper-level nodes while patient symptoms are considered at lower levels, following clinical diagnostic procedures. Finally, we run the SCDT where feature selection at each branching node is guided by predefined feature groups. This approach ensures the learning process follows the branching constraints imposed on the decision tree.

\begin{figure}[htbp]
\centering
\includegraphics[width=0.9\columnwidth]{figures/OPDT_01}
\caption{Structure Constrained Decision Tree (SCDT). The tree uses a breadth-first indexing scheme where nodes are numbered sequentially from the root (index 1) through internal nodes to leaf nodes. Feature groups $(G_{A}, G_{1}, G_{2}, \dots)$ can be overlapping, where $G_{A}$ denotes a special feature group containing all features, used when prior domain knowledge for defining feature groups is unavailable.}
\label{figure:SCDT}
\end{figure}

\subsection{Optimal Pattern Detection Tree}

In this section, we first introduce the volume-impurity index that measures coverage volume and misclassification by a rule and then present the basic concepts of OPDT. Subsequently, we present the IP formulation of OPDT. Finally, we describe branching structure constraints on OPDT and their enhancements to improve computational efficiency.

\paragraph{Metrics}

Prior RBML methods employ heuristic approaches with no optimality guarantees, focusing on precision maximization while treating coverage as a byproduct \citep{cendrowska1988prism}, balancing precision with rule complexity such as the number of rules and conditions per rule \citep{lakkaraju2016interpretable}, adopting a two-phase strategy that prioritizes coverage in the growing phase before shifting to precision improvement through pruning on separate validation sets \citep{furnkranz1994incremental, cohen1995fast}, or balancing accuracy and interpretability through Bayesian likelihood and prior terms \citep{wang2017bayesian}. 

Unlike these traditional implicit or sequential evaluation approaches employed in heuristic rule learning, we define a novel metric specifically designed for optimization-based frameworks that explicitly balances coverage volume with misclassification control through the weight parameter $w$, allowing domain experts to specify misclassification tolerance directly within the objective function before model training. Specifically, we propose a volume-impurity (VI) index that balances classification volume and classification impurity, defined as:

\begin{align*}
\text{volume-impurity (VI) index} = & \text{``volume gain''} \\
- w & \times \text{``misclassification loss'' }
\end{align*}

where $w$ is the weight controlling precision.

In rule discovery, volume gain refers to the size of samples covered by the rule, while misclassification loss represents the size of incorrectly classified samples among the covered samples. The goal of OPDT is to identify a rule (i.e., root-to-leaf path in a decision tree) that maximizes the VI index in the leaf node.

\paragraph{Approach}

The primary focus in rule discovery is identifying patterns and regularities within a dataset. In this paper, we aim to identify an optimal rule that maximizes coverage volume while controlling misclassification through a given weight. We utilize a decision tree approach to discover this optimal rule, where each rule corresponds to a root-to-leaf path that maximizes the VI index through nested if-then-else structures handling negation scenarios while incorporating prior knowledge via structural constraints. As shown in Figure \ref{figure:OPDT}, given feature groups (e.g., $G_1, G_2, \dots, G_D$) from prior domain knowledge or heuristic algorithms, we initially assume a fixed length of rule conditions, allowing us to assign feature groups to specific depths in the chain of rule conditions (e.g., $\{G_1 - G_2 - \dots - G_D\}$) under the SCDT framework. While SCDT allows different feature groups at each branching node as shown in Figure \ref{figure:SCDT}, OPDT enforces the same feature group at each level of the decision tree to systematically explore all negation combinations (i.e., $\geq$ or $<$ for each feature within a feature group) across the chain of rule conditions. This determines the topology of the nested if-then-else tree. Using discrete optimization, we formulate our rule learning framework as a decision tree where leaf nodes are measured by the VI index and find rule conditions that maximize this value, yielding the optimal pattern among all possible combinations of rule conditions. We call this approach the Optimal Pattern Detection Tree.

\begin{figure}[htbp]
\centering
\includegraphics[width=0.75\columnwidth]{figures/OPDT_02}
\caption{Optimal Pattern Detection Tree (OPDT). All nodes at the same depth are restricted to use identical feature groups for the chain of rule conditions $\{G_1 - G_2 - \dots - G_D\}$.}
\label{figure:OPDT}
\end{figure}

\paragraph{Mathematical Formulation}

We use discrete optimization to solve the OPDT problem. Since OPDT aims to discover rule conditions rather than making predictions based on majority class of training samples in leaf nodes, the approach discards most nodes except for the node with maximal VI value. It extracts splitting conditions along the path from root to the leaf node with maximal VI value to form the IF part of the decision rule. The learned rule is represented as a chain of conditions and a label prediction. When a sample meets all conditions in the IF part, we say the rule fires, and we define such a sample as covered by the rule. The rule is then evaluated by two metrics: coverage (measured by fired volume) and precision (measured by misclassification loss). We formulate OPDT as a mixed-integer program as follows:

\small
\setlength{\tabcolsep}{3pt}
\renewcommand{\arraystretch}{1.0}
\begin{xltabular}{1.0\textwidth}{@{} >{\hsize=0.11\hsize\centering\arraybackslash}X | X @{}}
\caption{Sets for OPDT} \label{table:OPDT_sets} \\
  \toprule
  Symbol & Definition\\
  \midrule
  \endfirsthead

  \caption[]{Table \thetable{} continued from previous page} \\
  \toprule
  Symbol & Definition\\
  \midrule
  \endhead

  \midrule
  \multicolumn{2}{r}{Continued on next page} \\
  \endfoot

  \bottomrule
  \endlastfoot
  $P$  & set of features in a dataset with one-hot encoded categorical variables\\ %
  $N$  & set of samples in a dataset\\
  $x_{i}$  & $i$th sample in a dataset, $i \in N$\\
  $y_{i}$  & $i$th sample's label in a dataset, $i \in N$\\
  $K$  & set of class label\\
  $k$  & class label, $k \in K$\\
  $Y_{ik}$  & one-hot encoded matrix on $y$, $\mathbf{1}$\{$y_{i} = k$\}, $\forall k \in K, i \in N$\\
  $D$  & maximum depth of the optimal decision tree\\
  $T$  & maximum possible nodes by tree with a depth of $D$, $T=2^{(D+1)} - 1$\\
  $T_{B}$  & set of branch nodes with split $a^{T}x < b$, $T_{B}=\{1, \dots, \lfloor T/2 \rfloor \}$\\
  $T_{L}$  & set of leaf nodes with class prediction, $T_{L}=\{\lfloor T/2 \rfloor + 1, \dots, T \}$\\
  $T_{obj}$ & set of target leaf nodes where the VI index is evaluated in the objective function,  $T_{obj} \subseteq T_{L}$\\ %
  $t$  & index of each node, $t \in \{ 1, \dots, T \}$\\
  $w$  & weight on misclassified samples in a node, $w \geq 1$\\
  $p(t)$  & parent node of node $t$\\
  $A(t)$  & set of ancestors of node $t$, $A(t) = A_{L}(t) \cup A_{R}(t)$\\
  $A_{L}(t)$  & set of left-branch ancestors of node $t$ whose left branch has been followed on the path from the root node to node $t$\\
  $A_{R}(t)$  & set of right-branch ancestors of node $t$ whose right branch has been followed on the path from the root node to node $t$\\
  $E_{L}(t)$  & set of all leaf nodes from a left child of node $t$\\ %
  $E_{R}(t)$  & set containing two rightmost leaf nodes, one from each child (left and right) of node $t$\\ %
  $B(\text{split})[t]$  & binary value indicating whether branching node $t$ in $T_{B}$ contains a split, $|B(\text{split})| = |T_{B}|$\\
  $B(\text{group})[t]$  & feature group assigned to branching node $t$ in $T_{B}$, $|B(\text{group})| = |T_{B}|$\\
  $L(\text{target})[t]$  & binary value indicating whether leaf node $t$ in $T_{L}$ is a target leaf node, $|L(\text{target})| = |T_{L}|$\\
\end{xltabular}
\normalsize

\small
\setlength{\tabcolsep}{3pt}
\renewcommand{\arraystretch}{1.0}
\begin{xltabular}{1.0\textwidth}{@{} >{\hsize=0.11\hsize\centering\arraybackslash}X | X @{}}
\caption{Variables for OPDT} \label{table:OPDT_variables} \\
  \toprule
  Symbol & Definition\\
  \midrule
  \endfirsthead

  \caption[]{Table \thetable{} continued from previous page} \\
  \toprule
  Symbol & Definition\\
  \midrule
  \endhead

  \midrule
  \multicolumn{2}{r}{Continued on next page} \\
  \endfoot

  \bottomrule
  \endlastfoot
  $a_{pt}$  & slope element of split applied at node $t$, $\forall p \in P, \forall t \in T_{B}$\\
  $\vec{a}_{t}$  & slope vector of split applied at node $t$, where $\vec{a}_{t} = (a_{1t}, a_{2t}, \dots, a_{|P|t})$, $\forall t \in T_{B}$ \\
  $b_{t}$  & intercept of split applied at node $t$, $\forall t \in T_{B}$ \\
  $d_{t}$  & $\mathbf{1}$\{node $t$ applies a split\}, $\forall t \in T_{B}$ \\
  $z_{it}$  & $\mathbf{1}$\{$x_{i}$ is in node $t$\}, $\forall i \in N, \forall t \in T_{L}$ \\
  $l_{t}$  & $\mathbf{1}$\{leaf $t$ contains any sample\}, $\forall t \in T_{L}$ \\
  $N_{t}$  & total number of samples in node $t$, $\forall t \in T_{L}$ \\
  $N_{kt}$  & number of samples of label $k$ in node $t$, $\forall k \in K, \forall t \in T_{L}$\\
  $c_{kt}$  & $\mathbf{1}$\{the most common labels in node $t$ is $k$\}, $c_{kt} = \mathbf{1}$\{$\argmax_{k \in K} \{N_{kt} \} = k$\}, $\forall k \in K, \forall t \in T_{L}$ \\
  $L_{t}$  & the number of misclassified samples ($L_{t} = N_{t} - M_{t}$) in the leaf node $t$, where $M_{t}$ is total number of the most common labels in node $t$, $M_{t} = \max_{k \in K} N_{kt}$, $\forall t \in T_{obj}$\\
  $I_{t}$  & volume-impurity (VI) value calculated as the total number of samples minus the misclassified samples in a leaf node, $I_{t} = \{N_{t} - w L_{t}\}, \forall t \in T_{obj}$\\
  $I_{max}$  & maximum number of $I_{t}$ over all target leaf nodes, $I_{max} = \max_{t \in T_{obj}} I_{t}$\\
  $q_{t}$  & binary variable indicating whether leaf node $t$ has the maximum VI index, $\forall t \in T_{obj}$\\
\end{xltabular}
\normalsize

\begin{equation}
\max I_{max} \label{obj:OPDT}
\end{equation}
\allowdisplaybreaks
\begin{align}
\text{subject to: } \nonumber \\
\sum_{p=1}^{P} a_{pt} &= d_{t}, \quad \forall t \in T_{B} \label{con:split_on_off}\\
d_{t} & \leq d_{p(t)}, \quad \forall t \in T_{B} \setminus \{1\} \label{con:tree}\\
l_{s} & \leq d_{t}, \quad \forall t \in T_{B}, \forall s \in E_{L}(t) \label{con:branch_off_left}\\
l_{s} & \geq d_{t}, \qquad \forall t \in T_{B}, \forall s \in E_{R}(t) \label{con:branch_on_rightmost}\\
z_{it} & \leq l_{t}, \quad \forall i \in N, \forall t \in T_{L} \label{con:leafnode_on_off}\\
N_{t} &= \sum_{i=1}^{n} z_{it}, \quad \forall t \in T_{L} \label{con:num_samples}\\
\sum_{t\in T_{L}} z_{it} & = 1, \quad \forall i \in N \label{con:only_one}\\
\begin{split}
\vec{a}_{s}^{T} (x_{i} + \vec{\epsilon}) &\leq b_{s} + (1 + \epsilon_{max}) (1 - z_{it}), \\
& \qquad \qquad \qquad \forall i \in N, \forall t \in T_{L}, \forall s \in A_{L}(t) \label{con:left_branch}
\end{split}\\
\begin{split}
\vec{a}_{s}^{T} x_{i} &\geq b_{s} - (1 - z_{it}), \\
& \qquad \qquad \qquad \forall i \in N, \forall t \in T_{L}, \forall s \in A_{R}(t) \label{con:right_branch}
\end{split}\\
\sum_{k=1}^{K} c_{kt} &= l_{t}, \quad \forall t \in T_{L} \label{con:single_prediction}\\
N_{kt} &= \sum_{i=1}^{n} Y_{ik}z_{it}, \quad \forall k \in K, t \in T_{L} \label{con:num_samples_k}\\
\begin{split}
L_{t} &\geq N_{t} - N_{kt} - M (1 - c_{kt}), \\
& \qquad \qquad \qquad \qquad \qquad \quad \forall k \in K, \forall t \in T_{obj} \label{con:misclassification_loss_geq}
\end{split}\\
\begin{split}
L_{t} &\leq N_{t} - N_{kt} + M c_{kt}, \\
& \qquad \qquad \qquad \qquad \qquad \quad \forall k \in K, \forall t \in T_{obj} \label{con:misclassification_loss_leq}
\end{split}\\
I_{max} & \geq N_{t} - w L_{t} , \qquad \forall t \in T_{obj} \label{con:vi_geq}\\
I_{max} & \leq N_{t} - w L_{t} + M (1 - q_{t}), \qquad \forall t \in T_{obj} \label{con:vi_leq}\\
\sum_{t\in T_{obj}} q_{t} & = 1 \label{con:vi_only_one}\\
d_{t} \in & \{0, 1\}, \quad \forall t \in T_{B} \label{var:d_t}\\
a_{pt} \in & \{0, 1\}, \quad \forall p \in P, \forall t \in T_{B} \label{var:a_pt}\\
0 \leq b_{t} & \leq d_{t}, \quad \forall t \in T_{B} \label{var:b_t}\\
z_{it} \in & \{0, 1\}, \quad \forall i \in N, \forall t \in T_{L} \label{var:z_it}\\
l_{t} \in & \{0, 1\}, \quad \forall t \in T_{L} \label{var:l_t}\\
c_{kt} \in & \{0, 1\}, \quad \forall k \in K, \forall t \in T_{L} \label{var:c_kt}\\
N_{t} \in & \mathbb{N}, \quad \forall t \in T_{L} \label{var:N_t}\\
N_{kt} \in & \mathbb{N}, \quad \forall k \in K, \forall t \in T_{L} \label{var:N_kt}\\
L_{t} \in & \mathbb{N}, \quad \forall t \in T_{obj} \label{var:L_t}\\
q_{t} \in & \{0, 1\}, \quad \forall t \in T_{obj} \label{var:q_t}\\
I_{\max} \in & \mathbb{R} \label{var:I_max}
\end{align}

Our OPDT formulation adopts three fundamental constraints from \citet{bertsimas2017optimal}'s OCT: tree structure constraints (split decisions at branching nodes), sample routing constraints (directing samples from root to leaf), and misclassification constraints (counting misclassified samples at each leaf node). On top of these, we introduce constraints that force samples to rightmost leaf nodes when parent nodes do not split, correcting sample routing behavior for accurate VI evaluation at branching-terminated parent nodes, and constraints that linearize the logical condition for maximal VI selection among leaf nodes. The objective (\ref{obj:OPDT}) maximizes volume on one of the leaf nodes while minimizing misclassification loss through the VI index, which combines volume and misclassification loss with weight $w$. Constraint (\ref{con:split_on_off}) ensures that a slope variable $a_{pt}$ for a branching node can be activated if the node applies a split. Constraint (\ref{con:tree}) allows a branch node to split only if its parent node splits. Constraint (\ref{con:branch_off_left}) ensures that all leaf nodes from a left child of a branching node have zero samples if the node doesn't split, forcing samples to be assigned to the rightmost leaf nodes. Constraint (\ref{con:branch_on_rightmost}) ensures that the two rightmost leaf nodes from both left and right children of a branching node must be active if the branching node splits. Constraint (\ref{con:leafnode_on_off}) guarantees that samples can be assigned to a leaf node if the node is active. Constraint (\ref{con:num_samples}) counts the total number of samples at leaves. Constraint (\ref{con:only_one}) forces each sample to be assigned to exactly one leaf. Constraints (\ref{con:left_branch} and \ref{con:right_branch}) formulate the branching rules at each node, where $\vec{\epsilon}$ denotes a vector of sufficiently small positive values $\epsilon_p$ for each feature $p \in P$ and $\epsilon_{\max} = \max_p \epsilon_p$. These $\epsilon_p$ and $\epsilon_{\max}$ arise from a technical necessity in MIP formulations: strict inequality constraints of the form $a_t^\top x_i < b_t$ are not directly supported by MIP solvers and must be converted to non-strict form. Practically, $\epsilon_p$ can be any sufficiently small positive number that does not cause numerical instabilities in the MIP solver. On the other hand, the largest valid value of $\epsilon_p$ is the smallest non-zero distance between adjacent values of feature $p$ in the data. We use the largest valid value of $\epsilon_p$ for numerical stability. The parameter $\epsilon_{\max}$ acts as the big-M constant since the maximum possible value of $a_t^\top(x_i + \vec{\epsilon}) - b_t$ is $1 + \epsilon_{\max}$. Constraints (\ref{con:misclassification_loss_geq} and \ref{con:misclassification_loss_leq}) represent misclassification loss in each target leaf node, where $M$ denotes a big-M constant. Here, we take $M = |N|$ as a sufficiently large value since the misclassification loss at any leaf node is bounded by $|N|$. Constraints (\ref{con:vi_geq}, \ref{con:vi_leq}, and \ref{con:vi_only_one}) represent the maximum value of the VI index over all target leaf nodes, where the largest valid value of $M$ is $w |N|$. Finally, decision variables (\ref{var:d_t}), (\ref{var:a_pt}), and (\ref{var:b_t}) specify the branching structure, where $d_t$ indicates whether node $t$ has a split, $a_{pt}$ selects features for the split, and $b_t$ defines the threshold for split $a^{T}x < b$ at each branch node. Variables (\ref{var:z_it}), (\ref{var:l_t}), and (\ref{var:c_kt}) determine sample assignments to leaf nodes, leaf node activation, and class predictions, respectively. Count variables (\ref{var:N_t}), (\ref{var:N_kt}), and (\ref{var:L_t}) track the number of samples in each leaf node, the number of samples per class in each leaf node, and the number of misclassified samples in target leaf nodes, respectively. Binary variable (\ref{var:q_t}) indicates whether leaf node $t$ has the maximum VI index. Finally, $I_{\max}$ (\ref{var:I_max}) represents the maximum VI index value across all target leaf nodes. For a fixed tree depth $D$, the formulation scales linearly in both $|N|$ and $|P|$, but exponentially in $D$, an inherent characteristic shared with \citet{bertsimas2017optimal}'s OCT. Since the new constraints and variables introduced in OPDT add only $O(2^D)$ additional variables and constraints, which are dominated by the OCT routing constraints, they do not alter the asymptotic complexity. Therefore, OPDT maintains the same $O(2^D \cdot (|P| + |N|))$ model size complexity as OCT.

\paragraph{Branching Structure Constraints}

We incorporate Branching Structure Constraints (BSC) into OPDT to integrate prior knowledge from domain experts. These constraints aim to improve the interpretability and compliance of the resulting rule while enhancing computational efficiency. We design a decision tree that complies with BSC to control: 1) the tree topology through splitting decisions and 2) feature allocation at each branching node through predefined feature groups. Specifically, we define two arrays for branching nodes: $B(split)$, which indicates whether each branching node $i$ needs to be split or not, and $B(group)$, which specifies feature group $G_j$ for branching node $i$, where $G_j$ is a subset of features used for branching. These arrays enable us to construct the desired tree structure and feature allocations over branching nodes to solve the OPDT problem. For leaf nodes, $L(target)$ is a binary array that identifies which leaf nodes are target nodes in the objective function, determined by a domain expert to be aligned with $B(split)$. We assume that feature subgroups for $B(group)$ can be determined by domain expert knowledge. However, this feature grouping is optional as mentioned in Section \ref{sec:SCDT}. When predefined feature groups are unavailable, OPDT has the flexibility to use a default feature group $G_{A}$ containing all features. By assigning $G_{A}$ to all branching nodes, OPDT can enumerate all feature combinations without any subgrouping constraints (e.g., not even requiring separation into numerical and categorical features). Finally, BSC adds additional constraints on $d_t$, $a_{pt}$, $b_t$, and $T_{obj}$ from Algorithm \ref{alg:BSC} to the main OPDT MIP formulation.

In this paper, we use only the basic numerical-categorical group separation without considering domain-specific knowledge for feature grouping. Specifically, we define $G_{N}$ as the set of numerical feature groups (i.e., feature groups containing only numerical features) and $G_{C}$ as the set of categorical feature groups (i.e., feature groups containing only one-hot encoded categorical features). $P_{N}(g)$ denotes the set of features in numerical feature group $g \in G_{N}$ and $P_{C}(g)$ denotes the set of features in categorical feature group $g \in G_{C}$, where $P_{N}(g), P_{C}(g) \subseteq P$. This basic grouping strategy demonstrates that even without sophisticated domain knowledge, OPDT can achieve significant computational gains through feature subgrouping, as we demonstrate in Section \ref{sec:computational_results}.

\begin{algorithm}[htbp]
  \caption{Branching Structure Constraints (BSC)}
  \label{alg:BSC}
  \textbf{Input:} $|B(\text{split})|, |B(\text{group})|, |L(\text{target})|$\\
   \tcp{$B(\text{split})$: binary array to indicate branching splits}
   \tcp{$B(\text{group})$: subset of features allocated to each branching node}
   \tcp{$L(\text{target})$: binary array to indicate target leaf nodes}
   \tcp{$P_{N}(g)$: set of features in numerical feature group $g \in G_{N}$}
   \tcp{$P_{N}(g)$: set of features in categorical feature group $g \in G_{C}$}

  \ForEach{$t \in T_{B}$} {
    $d_{t} = B(\text{split})[t]$\\
    $g \gets B(\text{group})[t]$\\
    \uIf{$g = G_{A}$}{
      $a_{pt} \geq 0, \qquad \forall p \in P$
    }\uElseIf{$g \in G_{N}$}{
      $a_{pt} = 0, \qquad \forall p \in P \setminus P_{N}(g)$
    }\uElseIf{$g \in G_{C}$}{
      $a_{pt} = 0, \qquad \forall p \in P \setminus P_{C}(g)$; $b_{t} = 0.5$ 
    }
  }
    $T_{obj} = \{t \in T_{L} | L(\text{target})[t] = 1 \}$\\
\end{algorithm}

\section{Computational Results}
\label{sec:computational_results}
In this section, we evaluate the performance of OPDT on datasets from the UCI Machine Learning Repository \citep{misc_uci}. Since the purpose of this paper is to identify a single pattern in data that maximizes coverage while minimizing the false positive rate due to misclassification, for algorithms that generate multiple rules, we analyze the best candidate rule among the set of rules generated by ruleset extracting models such as BRS \citep{wang2017bayesian}, IDS \citep{lakkaraju2016interpretable}, IREP \citep{furnkranz1994incremental}, PRISM \citep{cendrowska1988prism}, and Ripper \citep{cohen1995fast}. All experiments were conducted on an Apple M2 system with an 8-core CPU and 8GB of RAM. In all experiments, we set the weight $w$ for the VI index to 10 as the default value.

\subsection{Datasets}
For our experiments, we use all 15 publicly available datasets from the UCI Machine Learning Repository \citep{misc_uci} as detailed in Table \ref{table:UCI_dataset}. For domains that require high interpretability and reliability, we focus on healthcare and finance datasets \citep{misc_statlog_australian_credit_approval_143, misc_statlog_german_credit_data_144, blood_transfusion_service_center_176, misc_breast_cancer_wisconsin_diagnostic_17, chronic_kidney_disease_336, early_stage_diabetes_risk_prediction_529, echocardiogram_38, fertility_244, misc_heart_disease_45, heart_failure_clinical_records_519, hepatitis_46, ilpd_indian_liver_patient_dataset_225, parkinsons_174, pima_indians_diabetes, thoracic_surgery_data_277}. We apply an 80/20 train-test split for our experiments.

\footnotesize
\setlength{\tabcolsep}{3pt}
\renewcommand{\arraystretch}{1.0}
\begin{xltabular}{0.8\textwidth}{@{} >{\hsize=0.35\hsize\centering\arraybackslash}X | >{\hsize=0.2\hsize\centering\arraybackslash}X >{\hsize=0.17\hsize\centering\arraybackslash}X >{\hsize=0.15\hsize\centering\arraybackslash}X >{\hsize=0.25\hsize\centering\arraybackslash}X @{}}
\caption{UCI Dataset}
\label{table:UCI_dataset} \\
  \toprule
  \textbf{Dataset} & n\_instances (n\_missing) & n\_features (num, cat) & weights (neg, pos) & description \\
  \endfirsthead

  \multicolumn{5}{c}{Table \thetable{} continued from previous page} \\
  \toprule
  \textbf{Dataset} & n\_instances (n\_missing) & n\_features (num, cat) & weights (neg, pos) & description \\
  \midrule
  \endhead

  \midrule
  \multicolumn{5}{r}{Continued on next page} \\
  \endfoot

  \bottomrule
  \endlastfoot
  \midrule
  Australian           & 653 (37)   & 15 (6, 9)   & (357, 296) & credit approval \\
  German               & 1000 (0)   & 20 (7, 13)  & (300, 700) & credit approval \\
  Blood Transfusion    & 748 (0)    & 4 (4, 0)    & (570, 178) & healthcare \\
  Breast Cancer        & 569 (0)    & 30 (30, 0)  & (212, 357) & healthcare \\
  Chronic Kidney       & 215 (185)  & 24 (11, 13) & (128, 87)  & healthcare \\
  Early Stage Diabetes & 520 (0)    & 16 (1, 15)  & (320, 200) & healthcare \\
  Echocardiogram       & 62 (69)    & 7 (4, 3)    & (44, 18)   & healthcare \\
  Fertility            & 100 (0)    & 9 (2, 7)    & (88, 12)   & healthcare \\
  Heart Disease        & 297 (6)    & 13 (6, 7)   & (160, 137) & healthcare \\
  Heart Failure        & 299 (0)    & 12 (7, 5)   & (203, 96)  & healthcare \\
  Hepatitis            & 80 (75)    & 19 (6, 13)  & (67, 13)   & healthcare \\
  Indian Liver Patient & 579 (4)    & 10 (9, 1)   & (414, 165) & healthcare \\
  Parkinsons           & 195 (0)    & 23 (22, 1)  & (147, 48)  & healthcare \\
  Pima Indians Diabetes & 768 (0)   & 8 (8, 0)    & (500, 268) & healthcare \\
  Thoracic Surgery     & 470 (0)    & 16 (3, 13)  & (400, 70)  & healthcare \\
\end{xltabular}
\normalsize

\subsection{Performance Enhancements}

Generally, solving an ODT is known to be an NP-hard problem \citep{laurent1976constructing}. To address this computational challenge, we propose three complementary techniques to enhance performance: 1) Branching Structure Constraints (BSC), which leverages prior knowledge about feature relationships to reduce the search space, 2) warmstart initialization, which provides a quality initial solution to accelerate convergence, and 3) branching priority orders, which guides the search process by prioritizing promising decision variables. These techniques work synergistically - BSC narrows the feasible region, warmstart provides a good starting point within this reduced space, and priority ordering helps efficiently navigate toward optimal solutions. Our empirical results demonstrate that this combination significantly reduces computational time while maintaining or improving solution quality across diverse datasets.

First, BSC enhances both interpretability and computational efficiency by imposing meaningful constraints on the branching structure of decision trees. Table \ref{table:OPDT_by_BSC}, with a 10-minute time limit, demonstrates significant computational savings when prior knowledge about rule structure is incorporated. The impact is particularly dramatic for certain rule structures - for instance, employing \texttt{\{categorical\}--\{numerical\}} constraints reduces runtime from 1232 seconds to 23 seconds for Heart Disease and from 924 seconds to 21 seconds for Thoracic Surgery. Furthermore, decomposing features into numerical and categorical subgroups proves highly effective. In the case of Chronic Kidney Disease, this systematic decomposition strategy reduces the total runtime from 978 seconds (by all features) to approximately 45 seconds (by decomposing features). Beyond runtime improvements, the systematic feature decomposition assists in feature pruning by identifying essential structural patterns for optimality. For example, in Thoracic Surgery, while optimality of the solution with $VI=42$ remains uncertain due to the time limit being hit on one of feature decomposition, any potential better solution must exist under the \texttt{\{numerical\}--\{numerical\}} rule structure, as all other feature combinations involving categorical features were solved to optimality within the time limit. This effectively eliminates categorical features from the search space for an optimal solution. Such insights significantly reduce the search space by allowing us to focus computational resources on the most promising feature combinations. In real-world applications, prior domain knowledge could enable even finer feature groupings beyond the basic numerical-categorical group separation. Therefore, BSC not only enhances computational efficiency but also provides valuable guidance toward optimal solutions through strategic feature decomposition.

\footnotesize
\setlength{\tabcolsep}{3pt}
\renewcommand{\arraystretch}{1.0}
\begin{xltabular}{1.0\textwidth}{@{} >{\hsize=1.0\hsize\centering\arraybackslash}X | >{\hsize=0.5\hsize\centering\arraybackslash}X >{\hsize=0.3\hsize\centering\arraybackslash}X | >{\hsize=0.5\hsize\centering\arraybackslash}X >{\hsize=0.3\hsize\centering\arraybackslash}X | >{\hsize=0.5\hsize\centering\arraybackslash}X >{\hsize=0.3\hsize\centering\arraybackslash}X | >{\hsize=0.5\hsize\centering\arraybackslash}X >{\hsize=0.3\hsize\centering\arraybackslash}X | >{\hsize=0.5\hsize\centering\arraybackslash}X >{\hsize=0.3\hsize\centering\arraybackslash}X  @{}}
\caption{OPDT Runtime (in seconds) and VI by Branching Structure Constraints (BSC)} \label{table:OPDT_by_BSC} \\
  \toprule
  & \multicolumn{10}{c}{Structure Constraints (solver time limit = 600 sec)} \\
  \cmidrule(lr){2-11}
  & \multicolumn{2}{c}{all--all} & \multicolumn{2}{c}{num--num} & \multicolumn{2}{c}{num--cat} & \multicolumn{2}{c}{cat--num} & \multicolumn{2}{c}{cat--cat} \\
  \cmidrule(lr){2-3} \cmidrule(lr){4-5} \cmidrule(lr){6-7} \cmidrule(lr){8-9} \cmidrule(lr){10-11} 
  Dataset & Runtime & VI & Runtime & VI & Runtime & VI & Runtime & VI & Runtime & VI \\
  \endfirsthead

  \multicolumn{11}{c}{Table \thetable{} continued from previous page} \\
  \toprule
  & \multicolumn{10}{c}{Structure Constraints (solver time limit = 600 sec)} \\
  \cmidrule(lr){2-11}
  & \multicolumn{2}{c}{all--all} & \multicolumn{2}{c}{num--num} & \multicolumn{2}{c}{num--cat} & \multicolumn{2}{c}{cat--num} & \multicolumn{2}{c}{cat--cat} \\
  \cmidrule(lr){2-3} \cmidrule(lr){4-5} \cmidrule(lr){6-7} \cmidrule(lr){8-9} \cmidrule(lr){10-11} 
  Dataset & Runtime & VI & Runtime & VI & Runtime & VI & Runtime & VI & Runtime & VI \\
  \midrule
  \endhead

  \midrule
  \multicolumn{11}{r}{Continued on next page} \\
  \endfoot

  \bottomrule
  \endlastfoot
  \midrule
Australian              & 600.04 & 143 & 982.97 & 49 & 975.71 & 143 & 45.66 & 143 & 7.29 & 124 \\
German                  & 1475.81 & 31 & 631.47 & 35 & 1065.43 & 45 & 1006.65 & 45 & 968.56 & 44 \\
Chronic Kidney          & 977.94 & 99 & 23.35 & 93 & 16.66 & 99 & 4.64 & 99 & 0.57 & 88 \\
Early Stage Diabetes    & 23.07 & 151 & 3.53 & 25 & 6.88 & 130 & 1.32 & 130 & 8.27 & 151 \\
Echocardiogram          & 12.08 & 13 & 2.03 & 13 & 0.85 & 13 & 0.19 & 13 & 0.05 & 6 \\
Fertility               & 5.33 & 31 & 0.36 & 31 & 0.74 & 30 & 0.23 & 30 & 0.11 & 24 \\
Heart Disease           & 1232.83 & 31 & 1061.62 & 26 & 947.29 & 31 & 22.67 & 31 & 1.93 & 17 \\
Heart Failure           & 1017.01 & 61 & 1410.82 & 61 & 4.01 & 21 & 79.37 & 21 & - & - \\
Hepatitis               & 10.70 & 40 & 1.82 & 40 & 0.50 & 38 & 0.15 & 38 & 0.11 & 35 \\
Indian Liver Patient    & 1421.23 & 75 & 600.02 & 75 & 0.41 & 45 & 6.18 & 45 & - & - \\
Parkinsons              & 600.02 & 86 & 954.41 & 86 & 4.08 & 64 & 0.96 & 64 & - & - \\
Thoracic Surgery        & 924.10 & 39 & 964.28 & 34 & 77.15 & 42 & 20.54 & 42 & 2.34 & 42 \\
\end{xltabular}
\normalsize

For warmstart initialization, we develop a novel heuristic algorithm adapted from CART \citep{breiman1984classification} for structure constraints, called Branching Structure Constrained Classification and Regression Tree (BSCCART), as no existing heuristics for structure-constrained decision trees are available. BSCCART modifies CART's branching criteria by restricting feature selection at each node to comply with the branching structure constraints. By applying the same structure constraints used in OPDT, the BSCCART solution can serve as a warmstart for OPDT. The performance comparison between BSCCART and OPDT is presented in Tables \ref{table:OPDT_train_details} and \ref{table:OPDT_test_details}. For example, Figure \ref{figure:BSCART_OPDT} illustrates that the pattern discovered by BSCCART is improved by OPDT on the German dataset. While BSCCART produces a best VI of 8, OPDT improves this to 32 in 12 seconds (highlighted in red), progressing toward the optimal solution. This example demonstrates how OPDT can improve upon a heuristic solution obtained from ML algorithms or prior knowledge. For branching priority orders, we assign higher priorities to topology variables (e.g., $a_{pt}$ and $d_{t}$), guiding the MIP solver to decide the tree structure variables before others.

\begin{figure}[htbp]
\centering
\includegraphics[width=0.9\columnwidth]{figures/OPDT_03}
\caption{VI improvement by OPDT over BSCCART on the German dataset given $w = 10$. BSCCART, a heuristic method, yields a best VI value of 8. OPDT enhances it to 32 (highlighted in red), achieved in 12 seconds.}
\label{figure:BSCART_OPDT}
\end{figure}

Table \ref{table:OPDT_by_Warmstrat_BPO} presents the computational runtime and VI values across different configurations of warmstart and branching priority orders, where \texttt{on/on} denotes the use of both strategies. Generally, the \texttt{on/on} configuration demonstrates superior or equivalent performance compared to other configurations, showing trends toward higher VI values or faster runtimes. This is particularly evident in datasets solved to optimality within the time limit, such as Early Stage Diabetes, Echocardiogram, Fertility, and Hepatitis, which show identical VI values across all configurations. For these datasets, the runtime analysis clearly demonstrates that both warmstart and branching priority orders contribute to faster computation times compared to their \texttt{off} counterparts. Furthermore, the \texttt{on/on} configuration exhibits more stable performance with moderate standard deviations, while the \texttt{off/off} configuration often shows larger variability. However, the optimal configuration appears to be dataset-dependent, with some exceptions: Blood Transfusion achieves slightly better VI with \texttt{on/off}, Indian Liver Patient performs best with \texttt{on/off}, and Breast Cancer shows optimal results with \texttt{off/on}. Overall, despite some dataset-specific variations, the combined use of warmstart and branching priority orders (\texttt{on/on}) enhances both solution quality and computational efficiency across diverse datasets.

\footnotesize
\begin{xltabular}{1.0\textwidth}{@{} >{\hsize=0.8\hsize\centering\arraybackslash}X | >{\hsize=0.4\hsize\centering\arraybackslash}X >{\hsize=0.4\hsize\centering\arraybackslash}X | >{\hsize=0.4\hsize\centering\arraybackslash}X >{\hsize=0.4\hsize\centering\arraybackslash}X | >{\hsize=0.4\hsize\centering\arraybackslash}X >{\hsize=0.4\hsize\centering\arraybackslash}X | >{\hsize=0.4\hsize\centering\arraybackslash}X >{\hsize=0.4\hsize\centering\arraybackslash}X @{}}
\caption{OPDT Runtime (in seconds) and VI by Warmstart and Branching Priority Orders. \textbf{Bold} represents the highest VI value} \label{table:OPDT_by_Warmstrat_BPO} \\
  \toprule
  & \multicolumn{8}{c}{Warmstart from BSCCART / Branching Priority Orders (solver time limit = 120 sec)} \\
  \cmidrule(lr){2-9}
  & \multicolumn{2}{c}{on / on} & \multicolumn{2}{c}{on / off} & \multicolumn{2}{c}{off / on} & \multicolumn{2}{c}{off / off}\\
  \cmidrule(lr){2-3} \cmidrule(lr){4-5} \cmidrule(lr){6-7} \cmidrule(lr){8-9}
  Dataset & Runtime & VI & Runtime & VI & Runtime & VI & Runtime & VI \\
  \endfirsthead

  \multicolumn{9}{c}{Table \thetable{} continued from previous page} \\
  \toprule
  & \multicolumn{8}{c}{Warmstart from BSCCART / Branching Priority Orders (solver time limit = 120 sec)} \\
  \cmidrule(lr){2-9}
  & \multicolumn{2}{c}{on / on} & \multicolumn{2}{c}{on / off} & \multicolumn{2}{c}{off / on} & \multicolumn{2}{c}{off / off}\\
  \cmidrule(lr){2-3} \cmidrule(lr){4-5} \cmidrule(lr){6-7} \cmidrule(lr){8-9}
  Dataset & Runtime & VI & Runtime & VI & Runtime & VI & Runtime & VI \\
  \midrule
  \endhead

  \midrule
  \multicolumn{9}{r}{Continued on next page} \\
  \endfoot

  \bottomrule
  \endlastfoot
  \midrule
Australian & 18.715 $(\pm16.804)$ & \textbf{134.300} $(\pm12.996)$ & 26.616 $(\pm33.923)$ & \textbf{134.300} $(\pm12.996)$ & 23.496 $(\pm21.194)$ & 132.400 $(\pm12.963)$ & 54.946 $(\pm49.832)$ & \textbf{134.300} $(\pm12.781)$ \\
German & 29.296 $(\pm39.142)$ & \textbf{53.800} $(\pm17.568)$ & 11.331 $(\pm20.068)$ & 51.300 $(\pm17.468)$ & 42.333 $(\pm40.723)$ & 33.800 $(\pm12.109)$ & 108.277 $(\pm297.212)$ & 24.600 $(\pm19.856)$ \\
Blood Transfusion & 7.574 $(\pm9.662)$ & 51.000 $(\pm10.509)$ & 18.458 $(\pm26.890)$ & \textbf{51.200} $(\pm10.443)$ & 9.565 $(\pm12.220)$ & 50.100 $(\pm12.387)$ & 6.575 $(\pm6.100)$ & 50.800 $(\pm10.696)$ \\
Breast Cancer & 37.938 $(\pm30.127)$ & 230.100 $(\pm5.782)$ & 29.135 $(\pm19.873)$ & 230.000 $(\pm5.774)$ & 42.666 $(\pm32.837)$ & \textbf{231.000} $(\pm7.630)$ & 68.280 $(\pm26.746)$ & 223.300 $(\pm16.958)$ \\
Chronic Kidney & 1.230 $(\pm1.101)$ & 98.700 $(\pm3.234)$ & 3.675 $(\pm3.503)$ & \textbf{99.200} $(\pm3.327)$ & 3.201 $(\pm2.730)$ & \textbf{99.200} $(\pm3.327)$ & 4.746 $(\pm7.048)$ & 98.500 $(\pm3.408)$ \\
Early Stage Diabetes & 1.302 $(\pm1.032)$ & \textbf{153.400} $(\pm3.565)$ & 1.811 $(\pm1.275)$ & \textbf{153.400} $(\pm3.565)$ & 3.230 $(\pm1.231)$ & \textbf{153.400} $(\pm3.565)$ & 4.536 $(\pm1.904)$ & \textbf{153.400} $(\pm3.565)$ \\
Echocardiogram & 0.162 $(\pm0.180)$ & \textbf{18.667} $(\pm4.743)$ & 0.233 $(\pm0.263)$ & \textbf{18.667} $(\pm4.743)$ & 0.664 $(\pm0.552)$ & \textbf{18.667} $(\pm4.743)$ & 0.654 $(\pm0.563)$ & \textbf{18.667} $(\pm4.743)$ \\
Fertility & 0.073 $(\pm0.089)$ & \textbf{34.300} $(\pm1.947)$ & 0.101 $(\pm0.141)$ & \textbf{34.300} $(\pm1.947)$ & 0.252 $(\pm0.118)$ & \textbf{34.300} $(\pm1.947)$ & 0.342 $(\pm0.164)$ & \textbf{34.300} $(\pm1.947)$ \\
Heart Disease & 14.438 $(\pm19.564)$ & \textbf{28.200} $(\pm4.709)$ & 18.029 $(\pm29.863)$ & 26.700 $(\pm4.423)$ & 9.782 $(\pm12.683)$ & 26.100 $(\pm3.178)$ & 15.690 $(\pm16.870)$ & 27.700 $(\pm4.832)$ \\
Heart Failure & 10.971 $(\pm17.315)$ & \textbf{62.100} $(\pm5.859)$ & 11.723 $(\pm19.361)$ & 61.800 $(\pm5.996)$ & 11.219 $(\pm11.942)$ & 60.200 $(\pm4.467)$ & 16.533 $(\pm14.937)$ & 61.500 $(\pm6.329)$ \\
Hepatitis & 0.667 $(\pm1.357)$ & \textbf{42.200} $(\pm3.393)$ & 0.869 $(\pm1.834)$ & \textbf{42.200} $(\pm3.393)$ & 0.653 $(\pm0.551)$ & \textbf{42.200} $(\pm3.393)$ & 0.703 $(\pm0.428)$ & \textbf{42.200} $(\pm3.393)$ \\
Indian Liver Patient & 14.701 $(\pm15.843)$ & 82.300 $(\pm8.220)$ & 21.831 $(\pm21.734)$ & \textbf{84.000} $(\pm7.454)$ & 27.622 $(\pm31.831)$ & 83.800 $(\pm7.361)$ & 23.099 $(\pm28.159)$ & 76.200 $(\pm15.817)$ \\
Parkinsons & 25.639 $(\pm34.000)$ & \textbf{82.700} $(\pm3.268)$ & 15.436 $(\pm23.398)$ & 82.500 $(\pm3.308)$ & 30.355 $(\pm32.283)$ & \textbf{82.700} $(\pm3.268)$ & 15.611 $(\pm15.215)$ & \textbf{82.700} $(\pm3.268)$ \\
Pima Indians Diabetes & 22.204 $(\pm33.173)$ & \textbf{91.200} $(\pm5.329)$ & 26.598 $(\pm20.140)$ & 90.900 $(\pm6.173)$ & 21.091 $(\pm23.323)$ & 91.100 $(\pm5.322)$ & 54.652 $(\pm30.207)$ & 89.500 $(\pm6.704)$ \\
Thoracic Surgery & 29.033 $(\pm28.946)$ & \textbf{46.200} $(\pm5.095)$ & 21.901 $(\pm14.735)$ & 45.600 $(\pm4.812)$ & 26.106 $(\pm27.370)$ & 45.444 $(\pm5.223)$ & 32.679 $(\pm20.418)$ & \textbf{46.200} $(\pm5.095)$ \\
\end{xltabular}
\normalsize

\subsection{Computational Experiments}

In this paper, we preprocess features differently according to each algorithm's requirements. For RSCRULES, IREP, PRISM, IDS, and RIPPER, we discretize numerical features into ten equal bins. For BRS, we binarize features into a binary format. For OPDT and BSCRULES, while continuous numerical features can be used without transformation, categorical features must be one-hot encoded. Regarding rule complexity, we limit the length of rule conditions to two for RIPPER, BRS in ruleset extracting models, and all structure-constrained models as shown in Tables \ref{table:OPDT_train_details} and \ref{table:OPDT_test_details}. Other algorithms, having no such rule length parameter, can generate rules with conditions longer than two as they have more flexibility in rule generation. To provide another structure-constrained method alongside BSCCART, we develop a simple rule structure-constrained (RSC) algorithm based on Rules Extraction System (RULES) \citep{pham1995rules}, which we call RSCRULES. Finally, we impose a 10-minute time limit on OPDT runtime.

For experiments, we create 10 random data splits through shuffling. For each split, we compute performance metrics and report their averages across the splits. Additionally, since RIPPER and BRS are non-deterministic algorithms, we run them with multiple random seeds on each split: 50 different seeds for BRS and 10 different seeds for RIPPER. The computational details are provided in Table \ref{table:OPDT_train_test_details} in the Appendix. We compare the induced rules based on their precision, coverage, and volume-impurity. Precision is calculated as the ratio between the number of samples correctly classified by a rule and the total number of samples covered by the rule. Coverage is computed as the ratio between the number of samples covered by the rule and the total number of samples in the dataset. VI is computed based on volume and misclassified samples with respect to weight $w = 10$. We use test data to evaluate how well the induced rule captures a common pattern in unseen data.

\footnotesize
\setlength{\tabcolsep}{3pt}
\renewcommand{\arraystretch}{1.0}
\begin{xltabular}{1.0\textwidth}{@{} >{\hsize=1.2\hsize\centering\arraybackslash}X >{\hsize=0.6\hsize\centering\arraybackslash}X | >{\hsize=0.8\hsize\centering\arraybackslash}X >{\hsize=0.8\hsize\centering\arraybackslash}X >{\hsize=0.85\hsize\centering\arraybackslash}X | >{\hsize=0.55\hsize\centering\arraybackslash}X  >{\hsize=0.55\hsize\centering\arraybackslash}X >{\hsize=0.55\hsize\centering\arraybackslash}X >{\hsize=0.55\hsize\centering\arraybackslash}X  >{\hsize=0.55\hsize\centering\arraybackslash}X  @{}}
\caption{OPDT Performance Compared to Benchmark Methods on Training Datasets (80\%). \textbf{Bold} indicates ranking in top 4 methods, and \underline{underlined} represents the best performance across all methods.} \label{table:OPDT_train_details} \\
  \toprule
  & & \multicolumn{8}{c}{Training Data (80\%)} \\
  \cmidrule(lr){3-10}
  & & \multicolumn{3}{c}{Structure-constrained Models} & \multicolumn{5}{c}{Ruleset Extracting Models} \\
  \cmidrule(lr){3-5} \cmidrule(lr){6-10} 
  \multicolumn{2}{c}{Dataset} & OPDT & BSCCART & RSCRULES & BRS & IDS & IREP & PRISM & Ripper \\
  \endfirsthead

  \multicolumn{10}{c}{Table \thetable{} continued from previous page} \\
  \toprule
  & & \multicolumn{8}{c}{Training Data (80\%)} \\
  \cmidrule(lr){3-10}
  & & \multicolumn{3}{c}{Structure-constrained Models} & \multicolumn{5}{c}{Ruleset Extracting Models} \\
  \cmidrule(lr){3-5} \cmidrule(lr){6-10} 
  \multicolumn{2}{c}{Dataset} & OPDT & BSCCART & RSCRULES & BRS & IDS & IREP & PRISM & Ripper \\
  \midrule
  \endhead

  \midrule
  \multicolumn{10}{r}{Continued on next page} \\
  \endfoot

  \bottomrule
  \endlastfoot
  \midrule
  \multirow{2}{*}{\parbox{\linewidth}{\centering Australian}}                 & \text{VI}     & \textbf{\underline{135.60}}  & \textbf{130.50} & 103.40  & \textbf{116.72}  & 36.30 & 103.40 & 44.20 & \textbf{114.38}       \\
                                                                              & \text{Time}    & 602.00 & 0.25  & 0.04 & 0.36  & 0.22 & 0.08 & 3.03 & 2.01       \\
  \multirow{2}{*}{\parbox{\linewidth}{\centering German}}                     & \text{VI}     & \textbf{\underline{60.00}}  & \textbf{49.80} & \textbf{57.80}  & 16.27  & 19.20 & -32.20 & 20.10 & \textbf{47.62}       \\
                                                                              & \text{Time}    & 749.82 & 0.20  & 0.10 & 0.53  & 0.25 & 0.09 & 7.20 & 0.17       \\
  \multirow{2}{*}{\parbox{\linewidth}{\centering Blood Transfusion}}          & \text{VI}     & \textbf{\underline{51.20}}  & \textbf{45.50} & \textbf{24.00}  & -583.40  & 12.20 & 10.72 & 15.10 & \textbf{18.17}       \\
                                                                              & \text{Time}    & 90.70 & 0.13  & 0.02 & 0.23  & 0.23 & 0.04 & 0.53 & 0.07       \\
  \multirow{2}{*}{\parbox{\linewidth}{\centering Breast Cancer}}              & \text{VI}     & \textbf{\underline{230.00}}  & \textbf{183.50} & \textbf{124.60}  & 114.00  & 111.70 & 50.51 & \textbf{114.30} & 50.50       \\
                                                                              & \text{Time}    & 790.97 & 7.25  & 0.06 & 0.59  & 0.22 & 0.39 & 2.95 & 0.49       \\
  \multirow{2}{*}{\parbox{\linewidth}{\centering Chronic Kidney}}             & \text{VI}     & \textbf{\underline{99.20}}  & \textbf{98.10} & 52.90  & 46.31  & 49.60 & \textbf{79.10} & 52.90 & \textbf{84.00}       \\
                                                                              & \text{Time}    & 6.60 & 0.24  & 0.02 & 0.33  & 0.19 & 0.13 & 0.29 & 0.15       \\
  \multirow{2}{*}{\parbox{\linewidth}{\centering Early Stage Diabetes}}       & \text{VI}     & \textbf{\underline{153.40}}  & \textbf{\underline{153.40}} & 121.20  & \textbf{151.83}  & 119.30 & 118.16 & 90.60 & \textbf{151.75}       \\
                                                                              & \text{Time}    & 3.71 & 0.03  & 0.02 & 0.54  & 0.21 & 0.04 & 0.63 & 0.08       \\
  \multirow{2}{*}{\parbox{\linewidth}{\centering Echocardiogram}}             & \text{VI}     & \textbf{\underline{18.10}}  & \textbf{14.50} & \textbf{9.90}  & -8.13  & \textbf{7.50} & 4.31 & 7.10 & 5.74       \\
                                                                              & \text{Time}    & 0.80 & 0.07  & 0.01 & 0.24  & 0.18 & 0.05 & 0.18 & 0.07       \\
  \multirow{2}{*}{\parbox{\linewidth}{\centering Fertility}}                  & \text{VI}     & \textbf{\underline{34.30}}  & \textbf{\underline{34.30}} & 17.40  & \textbf{18.71}  & \textbf{20.30} & 16.19 & 15.00 & 17.18       \\
                                                                              & \text{Time}    & 0.25 & 0.02  & 0.01 & 0.27  & 0.19 & 0.03 & 0.13 & 0.04       \\
  \multirow{2}{*}{\parbox{\linewidth}{\centering Heart Disease}}              & \text{VI}     & \textbf{\underline{29.20}}  & \textbf{16.80} & \textbf{20.50}  & 15.01  & \textbf{15.10} & -5.89 & 13.30 & 13.59       \\
                                                                              & \text{Time}    & 524.81 & 0.14  & 0.03 & 0.36  & 0.20 & 0.06 & 1.01 & 0.10       \\
  \multirow{2}{*}{\parbox{\linewidth}{\centering Heart Failure}}              & \text{VI}     & \textbf{\underline{62.10}}  & \textbf{29.60} & 16.30  & -41.97  & 19.40 & 17.58 & \textbf{20.70} & \textbf{21.41}       \\
                                                                              & \text{Time}    & 729.32 & 0.36  & 0.03 & 0.30  & 0.19 & 0.07 & 0.82 & 0.11       \\
  \multirow{2}{*}{\parbox{\linewidth}{\centering Hepatitis}}                  & \text{VI}     & \textbf{\underline{42.20}}  & \textbf{40.10} & 17.50  & \textbf{29.31}  & 25.00 & 26.33 & 21.80 & \textbf{28.00}       \\
                                                                              & \text{Time}    & 0.82 & 0.10  & 0.01 & 0.27  & 0.19 & 0.06 & 0.15 & 0.07       \\
  \multirow{2}{*}{\parbox{\linewidth}{\centering Indian Liver Patient}}       & \text{VI}     & \textbf{\underline{82.50}}  & \textbf{54.50} & 19.60  & -790.93  & 17.40 & \textbf{42.62} & 18.10 & \textbf{46.09}       \\
                                                                              & \text{Time}    & 999.70 & 0.61  & 0.04 & 0.27  & 0.21 & 0.09 & 2.98 & 0.14       \\
  \multirow{2}{*}{\parbox{\linewidth}{\centering Parkinsons}}                 & \text{VI}     & \textbf{\underline{82.00}}  & \textbf{70.20} & \textbf{28.50}  & \textbf{52.25}  & 25.40 & 17.56 & 26.70 & 18.16       \\
                                                                              & \text{Time}    & 863.11 & 1.77  & 0.03 & 0.37  & 0.19 & 0.35 & 1.56 & 0.41       \\
  \multirow{2}{*}{\parbox{\linewidth}{\centering Pima Indians Diabetes}}      & \text{VI}     & \textbf{\underline{89.10}}  & \textbf{36.20} & \textbf{37.40}  & -920.73  & 21.50 & 22.69 & 22.80 & \textbf{26.88}       \\
                                                                              & \text{Time}    & 1047.67 & 0.76  & 0.07 & 0.35  & 0.22 & 0.08 & 3.35 & 0.13       \\
  \multirow{2}{*}{\parbox{\linewidth}{\centering Thoracic Surgery}}           & \text{VI}     & \textbf{\underline{46.20}}  & \textbf{20.20} & \textbf{31.70}  & -21.07  & 14.40 & -2.14 & 16.90 & \textbf{26.36}       \\
                                                                              & \text{Time}    & 133.21 & 0.10  & 0.02 & 2.13  & 0.21 & 0.05 & 1.70 & 2.16       \\
\end{xltabular}
\normalsize

First, Table \ref{table:OPDT_train_details} shows that OPDT can find better common patterns in data compared to other models, including ruleset extracting models that have more flexibility in rule length. Regarding runtime in Table \ref{table:OPDT_train_test_details} in the Appendix, we measure total runtime including warmstart generation, model initialization, and 10 minutes of solver time, which can exceed 600 seconds. In other words, a runtime less than 600 seconds indicates the problem was solved to optimality. Table \ref{table:OPDT_train_test_details} demonstrates that several datasets were solved within the time limit, showing that OPDT is applicable for finding an optimal rule in moderate-sized datasets. Furthermore, as shown in Table \ref{table:OPDT_solution_time}, even for datasets that do not reach optimality within the 10-minute limit, the best solutions are typically found within the first 2 minutes. Second, among the structure-constrained heuristic methods we developed, Table \ref{table:OPDT_train_details} shows that BSCCART performs better than RSCRULES. Moreover, BSCCART achieves the second-best performance in finding high-quality rules among all methods while complying with structure constraints given by domain experts. Among the ruleset extracting models, RIPPER shows comparably good performance to other methods. As shown in Table \ref{table:OPDT_train_test_details}, OPDT requires more than 10 minutes to prove optimality for some datasets. This could raise concerns about the scalability of OPDT to real-world applications. However, in this paper, we use only the basic numerical-categorical group separation without considering domain-specific knowledge for feature grouping. As shown in Table \ref{table:OPDT_by_BSC}, if feature grouping is actively utilized, computational efficiency can be enhanced significantly. We further evaluate the induced rules from each model on test datasets. Table \ref{table:OPDT_test_details} shows that OPDT achieves the best VI scores in 6 datasets, which is double that of BSCCART or RIPPER that have the best VI in 3 datasets. Furthermore, OPDT performs within the top 4 in most datasets, with exceptions in 5 datasets. BSCCART and RIPPER maintain their strong performance on test datasets, consistent with their performance in training datasets, showing the second-best results after OPDT.

However, it is worth noting that OPDT's strong training performance does not always transfer to the test set. For example, even though OPDT achieves a higher VI value than BSCCART on training data for the German and Heart Disease datasets, it fails to show better VI on test data. Specifically, in the German dataset, OPDT achieves precision $0.936$ and coverage $0.225$ on training data, but precision $0.883$ and coverage $0.231$ on test data, resulting in a test VI of $-7.9$. In contrast, BSCCART achieves precision $0.920$ and coverage $0.315$ on training data, and precision $0.896$ and coverage $0.326$ on test data, resulting in a test VI of $-1.8$. A similar pattern is observed in the Heart Disease dataset: OPDT achieves precision $0.979$ and coverage $0.165$ on training data, but precision $0.800$ and coverage $0.180$ on test data, yielding a test VI of $-9.2$, whereas BSCCART achieves precision $0.926$ and coverage $0.269$ on training data and precision $0.944$ and coverage $0.250$ on test data, yielding a test VI of $5.0$. In both cases, OPDT identifies a narrower region (lower coverage than BSCCART on training data) to achieve higher precision. However, this precision advantage is not stable on the test set. This suggests that the weight parameter $w$ may force OPDT toward a narrow region that fails to represent the whole dataset, implying that $w = 10$ may not be an appropriate value for representing the dominant pattern in these datasets. As another case, in the Thoracic Surgery dataset, while BSCCART, IDS, and PRISM identify rules with very narrow coverage ($0.064$, $0.038$, and $0.045$, respectively) but near-perfect precision ($0.991$, $1.000$, and $1.000$) on training data, OPDT identifies a substantially broader rule with coverage $0.264$ and precision $0.949$, resulting in a higher training VI of $46.2$ compared to $20.2$, $14.4$, and $16.9$ for BSCCART, IDS, and PRISM, respectively. However, on test data, OPDT's broader rule yields a test VI of $-9.8$, whereas BSCCART, IDS, and PRISM achieve better test VI values of $-3.8$, $-0.1$, and $4.5$, respectively. Even though OPDT discovers a more representative pattern in data with broader coverage ($0.247$) and reasonable precision ($0.855$) than BSCCART (precision $0.521$, coverage $0.066$), IDS (precision $0.933$, coverage $0.031$), and PRISM (precision $1.000$, coverage $0.048$), the weight parameter $w$ penalizes OPDT's broader coverage rule. This further supports the observation that the choice of $w$ is critical to finding a representative pattern in data. Additionally, it is worth noting that IDS and PRISM are not constrained to rules with only two conditions, unlike OPDT, potentially giving them an advantage in searching for patterns having more than two conditions.

Finally, Table \ref{table:OPDT_sensitivity_w} presents the sensitivity analysis of the weight parameter $w$ on the UCI Hepatitis dataset. As $w$ decreases from 10 to 2, OPDT exhibits a monotone precision–coverage trade-off, yielding higher coverage at the cost of reduced precision, confirming that $w$ functions as an adjustable precision requirement. Runtime remains stable across all $w$ values, and test performance closely tracks training performance, indicating strong out-of-sample performance.

\footnotesize
\setlength{\tabcolsep}{3pt}
\renewcommand{\arraystretch}{1.0}
\begin{xltabular}{0.75\textwidth}{@{} >{\hsize=0.4\hsize\centering\arraybackslash}X | >{\hsize=0.35\hsize\centering\arraybackslash}X >{\hsize=0.35\hsize\centering\arraybackslash}X @{}}
\caption{OPDT Runtime and Time to Solution (in seconds) without BSC. Runtime represents the total computational time including optimality proof, while Time to Solution represents when the final improvement to the best solution was made.}
  \label{table:OPDT_solution_time} \\
  \toprule
  \multicolumn{1}{c}{Dataset} & Runtime & Time to Solution \\
  \endfirsthead

  \multicolumn{3}{c}{Table \thetable{} continued from previous page} \\
  \toprule
  \multicolumn{1}{c}{Dataset} & Runtime & Time to Solution \\
  \midrule
  \endhead

  \midrule
  \multicolumn{3}{r}{Continued on next page} \\
  \endfoot

  \bottomrule
  \endlastfoot
  \midrule
    Australian                & \hphantom{0}139.363 $(\pm45.693)$       & 11.625 $(\pm14.371)$      \\
    German                    & \hphantom{00}930.113 $(\pm382.954)$     & 104.823 $(\pm141.195)$    \\
    Blood Transfusion         & \hphantom{00}776.287 $(\pm534.019)$     & 2.771 $(\pm2.539)$        \\
    Breast Cancer             & \hphantom{0}1096.570 $(\pm234.938)$     & 125.454 $(\pm300.844)$    \\
    Chronic Kidney            & \hphantom{00}916.377 $(\pm292.314)$     & 119.097 $(\pm169.979)$    \\
    Early Stage Diabetes      & \hphantom{00}112.018 $(\pm311.394)$     & 2.764 $(\pm2.390)$        \\
    Echocardiogram            & \hphantom{00}109.457 $(\pm321.932)$     & 0.176 $(\pm0.487)$        \\
    Fertility                 & \hphantom{00}4.687 $(\pm1.996)$         & 0.213 $(\pm0.448)$        \\
    Heart Disease             & \hphantom{00}904.345 $(\pm380.713)$     & 41.906 $(\pm85.544)$      \\
    Heart Failure             & \hphantom{0}187.302 $(\pm42.433)$       & 17.734 $(\pm27.696)$      \\
    Hepatitis                 & \hphantom{00}8.601 $(\pm3.545)$         & 0.560 $(\pm1.071)$        \\
    Indian Liver Patient      & \hphantom{0}1161.727 $(\pm196.850)$     & 10.659 $(\pm12.828)$      \\
    Parkinsons                & \hphantom{00}899.464 $(\pm338.915)$     & 14.228 $(\pm29.471)$      \\
    Pima Indians Diabetes     & 600.030 $(\pm0.007)$                    & 140.321 $(\pm182.179)$    \\
    Thoracic Surgery          & \hphantom{0}585.058 $(\pm47.319)$       & 28.280 $(\pm32.392)$      \\
\end{xltabular}
\normalsize

\footnotesize
\setlength{\tabcolsep}{3pt}
\renewcommand{\arraystretch}{1.0}
\begin{xltabular}{1.0\textwidth}{@{} >{\hsize=1.8\hsize\centering\arraybackslash}X | >{\hsize=0.8\hsize\centering\arraybackslash}X >{\hsize=0.8\hsize\centering\arraybackslash}X >{\hsize=0.85\hsize\centering\arraybackslash}X | >{\hsize=0.55\hsize\centering\arraybackslash}X  >{\hsize=0.55\hsize\centering\arraybackslash}X >{\hsize=0.55\hsize\centering\arraybackslash}X >{\hsize=0.55\hsize\centering\arraybackslash}X  >{\hsize=0.55\hsize\centering\arraybackslash}X  @{}}
\caption{OPDT Performance Compared to Benchmark Methods on Testing Datasets (20\%). \textbf{Bold} indicates ranking in top 4 methods, and \underline{underlined} represents the best performance across all methods.} \label{table:OPDT_test_details} \\
  \toprule
  & \multicolumn{8}{c}{Test Data (20\%)} \\
  \cmidrule(lr){2-9}
  & \multicolumn{3}{c}{Structure-constrained Models} & \multicolumn{5}{c}{Ruleset Extracting Models} \\
  \cmidrule(lr){2-4} \cmidrule(lr){5-9} 
  \multicolumn{1}{c}{Dataset} & OPDT & BSCCART & RSCRULES & BRS & IDS & IREP & PRISM & Ripper \\
  \endfirsthead

  \multicolumn{9}{c}{Table \thetable{} continued from previous page} \\
  \toprule
  & \multicolumn{8}{c}{Test Data (20\%)} \\
  \cmidrule(lr){2-9}
  & \multicolumn{3}{c}{Structure-constrained Models} & \multicolumn{5}{c}{Ruleset Extracting Models} \\
  \cmidrule(lr){2-4} \cmidrule(lr){5-9} 
  \multicolumn{1}{c}{Dataset} & OPDT & BSCCART & RSCRULES & BRS & IDS & IREP & PRISM & Ripper \\
  \midrule
  \endhead

  \midrule
  \multicolumn{9}{r}{Continued on next page} \\
  \endfoot

  \bottomrule
  \endlastfoot
  \midrule

Australian              & \textbf{\underline{22.80}}  & \textbf{21.70} & \textbf{20.60}  & \textbf{20.93}  & 5.00 & -41.30 & 8.20 & -8.26       \\
German                  & -7.90  & -1.80 & -4.20  & \textbf{-0.71}  & \textbf{1.80} & -37.32 & \textbf{0.80} & \textbf{\underline{10.52}}       \\
Blood Transfusion       & \textbf{-6.50}  & \textbf{-6.60} & \textbf{\underline{-6.30}}  & -299.24  & -13.30 & -28.31 & \textbf{-10.10} & -31.68       \\
Breast Cancer           & \textbf{\underline{48.20}}  & \textbf{38.60} & \textbf{25.60}  & 21.22  & 21.30 & 10.04 & \textbf{25.30} & 10.62       \\
Chronic Kidney          & \textbf{16.80}  & \textbf{17.60} & 13.10  & 9.33  & 10.10 & \textbf{19.83} & 13.10 & \textbf{\underline{21.80}}       \\
Early Stage Diabetes    & \textbf{38.10}  & \textbf{\underline{39.60}} & 31.80  & \textbf{38.31}  & 27.20 & 26.44 & 24.00 & \textbf{38.74}       \\
Echocardiogram          & \textbf{-2.20}  & \textbf{-1.60} & -2.80  & -8.98  & \textbf{-2.20} & \textbf{\underline{-0.10}} & -2.30 & \textbf{-1.33}       \\
Fertility               & \textbf{\underline{6.60}}  & \textbf{\underline{6.60}} & 0.10  & 1.03  & 3.10 & \textbf{3.58} & \textbf{3.80} & -1.94       \\
Heart Disease           & -9.20  & \textbf{\underline{5.00}} & \textbf{-0.90}  & \textbf{0.64}  & \textbf{2.10} & -19.49 & 0.20 & -4.52       \\
Heart Failure           & \textbf{\underline{5.10}}  & \textbf{-1.20} & -5.20  & -19.25  & \textbf{3.80} & -8.55 & \textbf{1.00} & -6.73       \\
Hepatitis               & 0.50  & -2.20 & -5.40  & \textbf{3.43}  & 1.10 & \textbf{4.55} & \textbf{2.70} & \textbf{\underline{5.66}}       \\
Indian Liver Patient    & \textbf{\underline{12.70}}  & 2.30 & \textbf{4.20}  & -198.45  & 3.50 & \textbf{7.05} & \textbf{4.40} & -7.92       \\
Parkinsons              & 2.80  & -5.80 & \textbf{8.60}  & \textbf{\underline{9.06}}  & 0.00 & \textbf{4.06} & 3.20 & \textbf{4.60}       \\
Pima Indians Diabetes   & \textbf{\underline{4.90}}  & -8.40 & \textbf{\underline{4.90}}  & -237.90  & \textbf{3.70} & -9.60 & \textbf{1.60} & -8.96       \\
Thoracic Surgery        & -9.80  & \textbf{-3.80} & \textbf{-4.90}  & -27.67  & \textbf{-0.10} & -12.72 & \textbf{\underline{4.50}} & -8.17       \\
\end{xltabular}
\normalsize

\footnotesize
\setlength{\tabcolsep}{3pt}
\renewcommand{\arraystretch}{1.0}
\begin{xltabular}{1.0\textwidth}{@{} >{\hsize=0.6\hsize\centering\arraybackslash}X | >{\hsize=0.9\hsize\centering\arraybackslash}X >{\hsize=0.9\hsize\centering\arraybackslash}X | >{\hsize=0.9\hsize\centering\arraybackslash}X >{\hsize=0.9\hsize\centering\arraybackslash}X | >{\hsize=0.6\hsize\centering\arraybackslash}X >{\hsize=0.8\hsize\centering\arraybackslash}X @{}}
\caption{Sensitivity analysis of the weight $w$ on the UCI Hepatitis dataset (Training: 64, Test: 16). For precision, the counts in parentheses denote the number of true positives out of the total fired samples (e.g., $0.978\ (45/46)$ denotes 45 true positives out of 46 fired samples). For coverage, the counts in parentheses denote the total fired samples out of the total samples in the dataset (e.g., $0.719\ (46/64)$ denotes 46 fired samples out of 64 total samples). VI denotes the VI value in training data and Runtime denotes the total computational time in seconds.}
\label{table:OPDT_sensitivity_w} \\
  \toprule
  & \multicolumn{2}{c|}{Training Data (80\%)} & \multicolumn{2}{c|}{Test Data (20\%)} & & \\
  \cmidrule(lr){2-3} \cmidrule(lr){4-5}
  \multicolumn{1}{c|}{$w$} & Precision & Coverage & Precision & Coverage & VI & Runtime \\
  \endfirsthead

  \multicolumn{7}{c}{Table \thetable{} continued from previous page} \\
  \toprule
  & \multicolumn{2}{c|}{Train} & \multicolumn{2}{c|}{Test} & & \\
  \cmidrule(lr){2-3} \cmidrule(lr){4-5}
  \multicolumn{1}{c|}{$w$} & Precision & Coverage & Precision & Coverage & VI & Runtime \\
  \midrule
  \endhead

  \midrule
  \multicolumn{7}{r}{Continued on next page} \\
  \endfoot

  \bottomrule
  \endlastfoot
  \midrule
  10 & 1.000\ (38/38) & 0.594\ (38/64) & 1.000\ (9/9)   & 0.562\ (9/16)  & 38 & 13.83 \\
   9 & 1.000\ (38/38) & 0.594\ (38/64) & 1.000\ (9/9)   & 0.562\ (9/16)  & 38 & 11.45 \\
   8 & 0.978\ (45/46) & 0.719\ (46/64) & 0.923\ (12/13) & 0.812\ (13/16) & 38 &  9.62 \\
   7 & 0.978\ (45/46) & 0.719\ (46/64) & 0.923\ (12/13) & 0.812\ (13/16) & 39 &  8.92 \\
   6 & 0.978\ (45/46) & 0.719\ (46/64) & 0.923\ (12/13) & 0.812\ (13/16) & 40 & 10.25 \\
   5 & 0.978\ (45/46) & 0.719\ (46/64) & 0.923\ (12/13) & 0.812\ (13/16) & 41 &  8.34 \\
   4 & 0.960\ (48/50) & 0.781\ (50/64) & 0.923\ (12/13) & 0.812\ (13/16) & 42 &  9.45 \\
   3 & 0.960\ (48/50) & 0.781\ (50/64) & 0.929\ (13/14) & 0.875\ (14/16) & 44 &  7.06 \\
   2 & 0.898\ (53/59) & 0.922\ (59/64) & 0.875\ (14/16) & 1.000\ (16/16) & 47 &  7.74 \\
\end{xltabular}
\normalsize

\section{Conclusion and Future Work}
In this paper, we introduce an RBML model that offers high interpretability and compliance for rule discovery. Our proposed model, OPDT, is based on an optimal decision tree, incorporating customizable rule structures based on a decision maker's needs. OPDT effectively identifies and describes significant patterns within a reasonable time while adhering to desired rule structures and precision requirements. To enhance computational efficiency, we develop three key techniques: Branching Structure Constraints (BSC), warmstart initialization with BSCCART, and branching priority orders. Our experimental results across 15 UCI datasets demonstrate that OPDT consistently finds high-quality rules that comply with given structure constraints. Moreover, the systematic feature decomposition approach not only improves computational efficiency but also provides valuable guidance toward optimal solutions through strategic feature grouping. This work contributes to rule-based machine learning by providing a flexible framework for discovering interpretable rules that balance solution quality, structure compliance, and computational time.

Despite the contributions presented in this paper, a number of important challenges remain to be addressed in future work. First, even though OPDT can find an optimal pattern given a specific rule structure and weight, it requires a decision maker to specify these parameters appropriately to identify a representative pattern in data that generalizes beyond the training data. Second, while OPDT is designed to extract a single optimal pattern, it is natural to extend it to discover a set of rules that represent data optimally, which remains an important direction for future work. Finally, by the inherent nature of optimization-based methods, OPDT still has a scalability limitation on large-scale data. As demonstrated by the speedup achieved through BSC with different feature groupings, further research is needed to develop systematic approaches for utilizing the BSC framework with prior knowledge tailored to specific datasets.

\bibliography{main}
\bibliographystyle{tmlr}

\clearpage

\section*{Appendices} %
\appendix

\section{OPDT and Benchmark Methods Performance}
\footnotesize
\setlength{\tabcolsep}{3pt}
\renewcommand{\arraystretch}{0.95}
\begin{xltabular}{1.0\textwidth}{@{} >{\hsize=0.8\hsize\centering\arraybackslash}X | >{\hsize=0.4\hsize\centering\arraybackslash}X | >{\hsize=0.5\hsize\centering\arraybackslash}X >{\hsize=0.5\hsize\centering\arraybackslash}X >{\hsize=0.5\hsize\centering\arraybackslash}X | >{\hsize=0.5\hsize\centering\arraybackslash}X >{\hsize=0.5\hsize\centering\arraybackslash}X >{\hsize=0.5\hsize\centering\arraybackslash}X  @{}}
\caption{OPDT and Benchmark Methods Performance on Training Data (80\%) and Test Data (20\%): mean ($\pm$ standard deviation) across 10 random data splits} \label{table:OPDT_train_test_details} \\
  \toprule
  & \multicolumn{4}{c}{Training Data (80\%)} & \multicolumn{3}{c}{Test Data (20\%)} \\
  \cmidrule(lr){2-5} \cmidrule(lr){6-8} 
  Dataset & Runtime & Precision & Coverage & VI & Precision & Coverage & VI \\
  \endfirsthead

  \multicolumn{8}{c}{Table \thetable{} continued from previous page} \\
  \toprule
  & \multicolumn{4}{c}{Training Data (80\%)} & \multicolumn{3}{c}{Test Data (20\%)} \\
  \cmidrule(lr){2-5} \cmidrule(lr){6-8}  
  Dataset & Runtime & Precision & Coverage & VI & Precision & Coverage & VI \\
  \midrule
  \endhead

  \midrule
  \multicolumn{8}{r}{Continued on next page} \\
  \endfoot

  \bottomrule
  \endlastfoot
  \midrule
  \multicolumn{1}{l|}{Australian} & & & & & & & \\
  \multicolumn{1}{c|}{- OPDT}       & 602.00 $(\pm482.907)$ &     0.963 $(\pm0.007)$      &     0.411 $(\pm0.018)$      &     135.600 $(\pm12.492)$   &     0.940 $(\pm0.028)$      &     0.426 $(\pm0.039)$      &     22.800 $(\pm15.091)$    \\
  \multicolumn{1}{c|}{- BSCCART}    & 0.25 $(\pm0.175)$     &     0.957 $(\pm0.007)$      &     0.442 $(\pm0.015)$      &     130.500 $(\pm14.254)$   &     0.937 $(\pm0.023)$      &     0.456 $(\pm0.040)$      &     21.700 $(\pm13.598)$    \\
  \multicolumn{1}{c|}{- RSCRULES}   & 0.04 $(\pm0.011)$     &     0.943 $(\pm0.006)$      &     0.462 $(\pm0.006)$      &     103.400 $(\pm14.010)$   &     0.933 $(\pm0.023)$      &     0.478 $(\pm0.026)$      &     20.600 $(\pm14.010)$    \\
  \multicolumn{1}{c|}{- BRS}        & 0.36 $(\pm0.009)$     &     0.950 $(\pm0.006)$      &     0.453 $(\pm0.020)$      &     116.720 $(\pm13.667)$   &     0.934 $(\pm0.022)$      &     0.470 $(\pm0.029)$      &     20.930 $(\pm13.623)$    \\
  \multicolumn{1}{c|}{- IDS}        & 0.22 $(\pm0.005)$     &     1.000 $(\pm0.000)$      &     0.070 $(\pm0.028)$      &     36.300 $(\pm14.863)$    &     0.955 $(\pm0.065)$      &     0.069 $(\pm0.032)$      &     5.000 $(\pm5.598)$      \\
  \multicolumn{1}{c|}{- IREP}       & 0.08 $(\pm0.001)$     &     0.943 $(\pm0.006)$      &     0.462 $(\pm0.006)$      &     103.400 $(\pm14.010)$   &     0.811 $(\pm0.043)$      &     0.315 $(\pm0.089)$      &     -41.303 $(\pm27.431)$   \\
  \multicolumn{1}{c|}{- PRISM}      & 3.03 $(\pm0.133)$     &     1.000 $(\pm0.000)$      &     0.085 $(\pm0.007)$      &     44.200 $(\pm3.584)$     &     0.963 $(\pm0.078)$      &     0.085 $(\pm0.025)$      &     8.200 $(\pm8.135)$      \\
  \multicolumn{1}{c|}{- Ripper}     & 2.01 $(\pm5.924)$     &     0.952 $(\pm0.006)$      &     0.428 $(\pm0.014)$      &     114.384 $(\pm13.649)$   &     0.826 $(\pm0.063)$      &     0.106 $(\pm0.051)$      &     -8.260 $(\pm3.715)$     \\
  \multicolumn{1}{l|}{German} & & & & & & & \\
  \multicolumn{1}{c|}{- OPDT}       & 749.82 $(\pm370.714)$ &     0.936 $(\pm0.011)$      &     0.225 $(\pm0.085)$      &     60.000 $(\pm12.481)$    &     0.883 $(\pm0.020)$      &     0.231 $(\pm0.098)$      &     -7.900 $(\pm10.609)$    \\
  \multicolumn{1}{c|}{- BSCCART}    & 0.20 $(\pm0.126)$     &     0.920 $(\pm0.008)$      &     0.315 $(\pm0.023)$      &     49.800 $(\pm18.920)$    &     0.896 $(\pm0.027)$      &     0.326 $(\pm0.043)$      &     -1.800 $(\pm16.498)$    \\
  \multicolumn{1}{c|}{- RSCRULES}   & 0.10 $(\pm0.006)$     &     0.938 $(\pm0.012)$      &     0.205 $(\pm0.087)$      &     57.800 $(\pm13.815)$    &     0.895 $(\pm0.032)$      &     0.199 $(\pm0.094)$      &     -4.200 $(\pm10.696)$    \\
  \multicolumn{1}{c|}{- BRS}        & 0.53 $(\pm0.064)$     &     0.908 $(\pm0.010)$      &     0.349 $(\pm0.017)$      &     16.266 $(\pm33.298)$    &     0.901 $(\pm0.029)$      &     0.361 $(\pm0.030)$      &     -0.712 $(\pm20.104)$    \\
  \multicolumn{1}{c|}{- IDS}        & 0.25 $(\pm0.009)$     &     1.000 $(\pm0.000)$      &     0.024 $(\pm0.004)$      &     19.200 $(\pm3.360)$     &     0.950 $(\pm0.112)$      &     0.019 $(\pm0.007)$      &     1.800 $(\pm4.077)$      \\
  \multicolumn{1}{c|}{- IREP}       & 0.09 $(\pm0.002)$     &     0.867 $(\pm0.014)$      &     0.209 $(\pm0.043)$      &     -32.202 $(\pm13.316)$   &     0.801 $(\pm0.023)$      &     0.266 $(\pm0.051)$      &     -37.322 $(\pm13.754)$   \\
  \multicolumn{1}{c|}{- PRISM}      & 7.20 $(\pm0.112)$     &     1.000 $(\pm0.000)$      &     0.025 $(\pm0.003)$      &     20.100 $(\pm2.079)$     &     0.949 $(\pm0.086)$      &     0.024 $(\pm0.013)$      &     0.800 $(\pm4.803)$      \\
  \multicolumn{1}{c|}{- Ripper}     & 0.17 $(\pm0.001)$     &     0.925 $(\pm0.010)$      &     0.256 $(\pm0.045)$      &     47.618 $(\pm17.287)$    &     0.917 $(\pm0.017)$      &     0.207 $(\pm0.095)$      &     10.524 $(\pm8.096)$     \\
  \multicolumn{1}{l|}{Blood Transfusion} & & & & & & & \\
  \multicolumn{1}{c|}{- OPDT}       & 90.70 $(\pm11.522)$   &     0.948 $(\pm0.014)$      &     0.220 $(\pm0.055)$      &     51.200 $(\pm10.443)$    &     0.892 $(\pm0.017)$      &     0.220 $(\pm0.072)$      &     -6.500 $(\pm14.222)$    \\
  \multicolumn{1}{c|}{- BSCCART}    & 0.13 $(\pm0.004)$     &     0.955 $(\pm0.023)$      &     0.195 $(\pm0.085)$      &     45.500 $(\pm10.886)$    &     0.889 $(\pm0.025)$      &     0.195 $(\pm0.095)$      &     -6.600 $(\pm15.079)$    \\
  \multicolumn{1}{c|}{- RSCRULES}   & 0.02 $(\pm0.006)$     &     0.963 $(\pm0.021)$      &     0.088 $(\pm0.044)$      &     24.000 $(\pm5.416)$     &     0.841 $(\pm0.125)$      &     0.087 $(\pm0.053)$      &     -6.300 $(\pm11.046)$    \\
  \multicolumn{1}{c|}{- BRS}        & 0.23 $(\pm0.042)$     &     0.775 $(\pm0.043)$      &     0.922 $(\pm0.142)$      &     -583.396 $(\pm185.327)$ &     0.771 $(\pm0.044)$      &     0.921 $(\pm0.151)$      &     -299.236 $(\pm94.622)$  \\
  \multicolumn{1}{c|}{- IDS}        & 0.23 $(\pm0.006)$     &     1.000 $(\pm0.000)$      &     0.024 $(\pm0.007)$      &     12.200 $(\pm3.676)$     &     0.552 $(\pm0.337)$      &     0.023 $(\pm0.012)$      &     -13.300 $(\pm11.605)$   \\
  \multicolumn{1}{c|}{- IREP}       & 0.04 $(\pm0.001)$     &     0.931 $(\pm0.025)$      &     0.085 $(\pm0.012)$      &     10.720 $(\pm11.740)$    &     0.785 $(\pm0.040)$      &     0.124 $(\pm0.018)$      &     -28.306 $(\pm10.268)$   \\
  \multicolumn{1}{c|}{- PRISM}      & 0.53 $(\pm0.032)$     &     0.993 $(\pm0.015)$      &     0.034 $(\pm0.013)$      &     15.100 $(\pm3.900)$     &     0.707 $(\pm0.263)$      &     0.036 $(\pm0.019)$      &     -10.100 $(\pm4.977)$    \\
  \multicolumn{1}{c|}{- Ripper}     & 0.07 $(\pm0.002)$     &     0.959 $(\pm0.022)$      &     0.071 $(\pm0.023)$      &     18.166 $(\pm8.388)$     &     0.687 $(\pm0.039)$      &     0.112 $(\pm0.054)$      &     -31.678 $(\pm14.771)$   \\
  \multicolumn{1}{l|}{Breast Cancer} & & & & & & & \\
  \multicolumn{1}{c|}{- OPDT}       & 790.97 $(\pm255.965)$ &     0.992 $(\pm0.004)$      &     0.549 $(\pm0.013)$      &     230.000 $(\pm8.055)$    &     0.979 $(\pm0.018)$      &     0.537 $(\pm0.021)$      &     48.200 $(\pm10.185)$    \\
  \multicolumn{1}{c|}{- BSCCART}    & 7.25 $(\pm0.105)$     &     0.966 $(\pm0.014)$      &     0.617 $(\pm0.028)$      &     183.500 $(\pm29.289)$   &     0.958 $(\pm0.026)$      &     0.602 $(\pm0.041)$      &     38.600 $(\pm16.460)$    \\
  \multicolumn{1}{c|}{- RSCRULES}   & 0.06 $(\pm0.003)$     &     0.971 $(\pm0.007)$      &     0.386 $(\pm0.029)$      &     124.600 $(\pm20.359)$   &     0.962 $(\pm0.024)$      &     0.365 $(\pm0.043)$      &     25.600 $(\pm9.698)$     \\
  \multicolumn{1}{c|}{- BRS}        & 0.59 $(\pm0.092)$     &     1.000 $(\pm0.000)$      &     0.251 $(\pm0.025)$      &     113.998 $(\pm11.555)$   &     0.982 $(\pm0.022)$      &     0.235 $(\pm0.045)$      &     21.216 $(\pm5.170)$     \\
  \multicolumn{1}{c|}{- IDS}        & 0.22 $(\pm0.027)$     &     1.000 $(\pm0.000)$      &     0.245 $(\pm0.037)$      &     111.700 $(\pm16.747)$   &     0.982 $(\pm0.031)$      &     0.231 $(\pm0.048)$      &     21.300 $(\pm7.973)$     \\
  \multicolumn{1}{c|}{- IREP}       & 0.39 $(\pm0.001)$     &     1.000 $(\pm0.000)$      &     0.111 $(\pm0.007)$      &     50.514 $(\pm3.355)$     &     0.975 $(\pm0.030)$      &     0.110 $(\pm0.014)$      &     10.041 $(\pm4.096)$     \\
  \multicolumn{1}{c|}{- PRISM}      & 2.95 $(\pm0.169)$     &     1.000 $(\pm0.000)$      &     0.251 $(\pm0.024)$      &     114.300 $(\pm10.740)$   &     0.995 $(\pm0.017)$      &     0.239 $(\pm0.045)$      &     25.300 $(\pm4.762)$     \\
  \multicolumn{1}{c|}{- Ripper}     & 0.49 $(\pm0.003)$     &     1.000 $(\pm0.000)$      &     0.111 $(\pm0.007)$      &     50.502 $(\pm3.309)$     &     0.993 $(\pm0.008)$      &     0.100 $(\pm0.023)$      &     10.618 $(\pm3.265)$     \\
  \multicolumn{1}{l|}{Chronic Kidney} & & & & & & & \\
  \multicolumn{1}{c|}{- OPDT}       & 6.60 $(\pm10.083)$    &     1.000 $(\pm0.000)$      &     0.577 $(\pm0.019)$      &     99.200 $(\pm3.327)$     &     0.966 $(\pm0.021)$      &     0.600 $(\pm0.041)$      &     16.800 $(\pm4.237)$     \\
  \multicolumn{1}{c|}{- BSCCART}    & 0.24 $(\pm0.023)$     &     0.996 $(\pm0.005)$      &     0.594 $(\pm0.006)$      &     98.100 $(\pm4.433)$     &     0.967 $(\pm0.030)$      &     0.619 $(\pm0.025)$      &     17.600 $(\pm7.947)$     \\
  \multicolumn{1}{c|}{- RSCRULES}   & 0.02 $(\pm0.000)$     &     1.000 $(\pm0.000)$      &     0.308 $(\pm0.009)$      &     52.900 $(\pm1.524)$     &     1.000 $(\pm0.000)$      &     0.305 $(\pm0.035)$      &     13.100 $(\pm1.524)$     \\
  \multicolumn{1}{c|}{- BRS}        & 0.33 $(\pm0.066)$     &     0.999 $(\pm0.002)$      &     0.276 $(\pm0.014)$      &     46.305 $(\pm0.964)$     &     0.982 $(\pm0.008)$      &     0.276 $(\pm0.009)$      &     9.332 $(\pm0.879)$      \\
  \multicolumn{1}{c|}{- IDS}        & 0.19 $(\pm0.003)$     &     1.000 $(\pm0.000)$      &     0.288 $(\pm0.017)$      &     49.600 $(\pm2.914)$     &     0.983 $(\pm0.053)$      &     0.281 $(\pm0.040)$      &     10.100 $(\pm6.590)$     \\
  \multicolumn{1}{c|}{- IREP}       & 0.13 $(\pm0.001)$     &     0.984 $(\pm0.003)$      &     0.551 $(\pm0.012)$      &     79.096 $(\pm3.668)$     &     0.996 $(\pm0.004)$      &     0.488 $(\pm0.046)$      &     19.834 $(\pm1.621)$     \\
  \multicolumn{1}{c|}{- PRISM}      & 0.29 $(\pm0.135)$     &     1.000 $(\pm0.000)$      &     0.308 $(\pm0.009)$      &     52.900 $(\pm1.524)$     &     1.000 $(\pm0.000)$      &     0.305 $(\pm0.035)$      &     13.100 $(\pm1.524)$     \\
  \multicolumn{1}{c|}{- Ripper}     & 0.15 $(\pm0.001)$     &     0.987 $(\pm0.002)$      &     0.563 $(\pm0.006)$      &     84.004 $(\pm2.406)$     &     0.998 $(\pm0.002)$      &     0.520 $(\pm0.044)$      &     21.796 $(\pm1.456)$     \\
  \multicolumn{1}{l|}{Early Stage Diabetes} & & & & & & & \\
  \multicolumn{1}{c|}{- OPDT}       & 3.71 $(\pm3.314)$     &     0.999 $(\pm0.004)$      &     0.374 $(\pm0.012)$      &     153.400 $(\pm3.565)$    &     0.996 $(\pm0.012)$      &     0.386 $(\pm0.045)$      &     38.100 $(\pm3.725)$     \\
  \multicolumn{1}{c|}{- BSCCART}    & 0.03 $(\pm0.000)$     &     1.000 $(\pm0.000)$      &     0.369 $(\pm0.009)$      &     153.400 $(\pm3.565)$    &     1.000 $(\pm0.000)$      &     0.381 $(\pm0.034)$      &     39.600 $(\pm3.565)$     \\
  \multicolumn{1}{c|}{- RSCRULES}   & 0.02 $(\pm0.009)$     &     0.965 $(\pm0.006)$      &     0.448 $(\pm0.008)$      &     121.200 $(\pm10.163)$   &     0.968 $(\pm0.023)$      &     0.450 $(\pm0.032)$      &     31.800 $(\pm10.163)$    \\
  \multicolumn{1}{c|}{- BRS}        & 0.54 $(\pm0.086)$     &     1.000 $(\pm0.000)$      &     0.365 $(\pm0.009)$      &     151.830 $(\pm3.751)$    &     1.000 $(\pm0.001)$      &     0.369 $(\pm0.033)$      &     38.314 $(\pm3.537)$     \\
  \multicolumn{1}{c|}{- IDS}        & 0.21 $(\pm0.003)$     &     1.000 $(\pm0.000)$      &     0.287 $(\pm0.067)$      &     119.300 $(\pm27.745)$   &     0.995 $(\pm0.015)$      &     0.271 $(\pm0.082)$      &     27.200 $(\pm9.953)$     \\
  \multicolumn{1}{c|}{- IREP}       & 0.04 $(\pm0.001)$     &     0.966 $(\pm0.006)$      &     0.437 $(\pm0.019)$      &     118.158 $(\pm12.140)$   &     0.970 $(\pm0.017)$      &     0.381 $(\pm0.063)$      &     26.444 $(\pm7.322)$     \\
  \multicolumn{1}{c|}{- PRISM}      & 0.63 $(\pm0.059)$     &     1.000 $(\pm0.000)$      &     0.218 $(\pm0.056)$      &     90.600 $(\pm23.201)$    &     1.000 $(\pm0.000)$      &     0.231 $(\pm0.062)$      &     24.000 $(\pm6.446)$     \\
  \multicolumn{1}{c|}{- Ripper}     & 0.08 $(\pm0.002)$     &     0.999 $(\pm0.001)$      &     0.371 $(\pm0.010)$      &     151.746 $(\pm3.258)$    &     0.999 $(\pm0.001)$      &     0.375 $(\pm0.033)$      &     38.740 $(\pm3.551)$     \\
  \multicolumn{1}{l|}{Echocardiogram} & & & & & & & \\
  \multicolumn{1}{c|}{- OPDT}       & 0.80 $(\pm0.618)$     &     0.988 $(\pm0.020)$      &     0.430 $(\pm0.094)$      &     18.100 $(\pm4.818)$     &     0.762 $(\pm0.317)$      &     0.369 $(\pm0.191)$      &     -2.200 $(\pm6.477)$     \\
  \multicolumn{1}{c|}{- BSCCART}    & 0.07 $(\pm0.013)$     &     0.961 $(\pm0.036)$      &     0.500 $(\pm0.040)$      &     14.500 $(\pm7.990)$     &     0.886 $(\pm0.136)$      &     0.492 $(\pm0.126)$      &     -1.600 $(\pm9.477)$     \\
  \multicolumn{1}{c|}{- RSCRULES}   & 0.01 $(\pm0.001)$     &     0.986 $(\pm0.022)$      &     0.263 $(\pm0.163)$      &     9.900 $(\pm4.886)$      &     0.598 $(\pm0.439)$      &     0.246 $(\pm0.248)$      &     -2.800 $(\pm5.138)$     \\
  \multicolumn{1}{c|}{- BRS}        & 0.24 $(\pm0.035)$     &     0.937 $(\pm0.041)$      &     0.429 $(\pm0.118)$      &     -8.132 $(\pm15.676)$    &     0.782 $(\pm0.110)$      &     0.428 $(\pm0.174)$      &     -8.982 $(\pm3.871)$     \\
  \multicolumn{1}{c|}{- IDS}        & 0.18 $(\pm0.003)$     &     1.000 $(\pm0.000)$      &     0.153 $(\pm0.036)$      &     7.500 $(\pm1.780)$      &     0.831 $(\pm0.171)$      &     0.215 $(\pm0.149)$      &     -2.200 $(\pm3.994)$     \\
  \multicolumn{1}{c|}{- IREP}       & 0.05 $(\pm0.001)$     &     0.949 $(\pm0.032)$      &     0.267 $(\pm0.093)$      &     4.307 $(\pm5.100)$      &     0.935 $(\pm0.087)$      &     0.229 $(\pm0.095)$      &     -0.102 $(\pm3.682)$     \\
  \multicolumn{1}{c|}{- PRISM}      & 0.18 $(\pm0.016)$     &     1.000 $(\pm0.000)$      &     0.145 $(\pm0.095)$      &     7.100 $(\pm4.654)$      &     0.500 $(\pm0.456)$      &     0.131 $(\pm0.195)$      &     -2.300 $(\pm4.644)$     \\
  \multicolumn{1}{c|}{- Ripper}     & 0.07 $(\pm0.001)$     &     0.966 $(\pm0.027)$      &     0.223 $(\pm0.102)$      &     5.740 $(\pm4.041)$      &     0.894 $(\pm0.065)$      &     0.166 $(\pm0.089)$      &     -1.325 $(\pm2.444)$     \\
  \multicolumn{1}{l|}{Fertility} & & & & & & & \\
  \multicolumn{1}{c|}{- OPDT}       & 0.25 $(\pm0.132)$     &     1.000 $(\pm0.000)$      &     0.429 $(\pm0.024)$      &     34.300 $(\pm1.947)$     &     0.971 $(\pm0.049)$      &     0.480 $(\pm0.098)$      &     6.600 $(\pm4.477)$      \\
  \multicolumn{1}{c|}{- BSCCART}    & 0.02 $(\pm0.003)$     &     1.000 $(\pm0.000)$      &     0.429 $(\pm0.024)$      &     34.300 $(\pm1.947)$     &     0.971 $(\pm0.049)$      &     0.480 $(\pm0.098)$      &     6.600 $(\pm4.477)$      \\
  \multicolumn{1}{c|}{- RSCRULES}   & 0.01 $(\pm0.001)$     &     0.969 $(\pm0.024)$      &     0.342 $(\pm0.095)$      &     17.400 $(\pm3.950)$     &     0.912 $(\pm0.072)$      &     0.405 $(\pm0.119)$      &     0.100 $(\pm5.021)$      \\
  \multicolumn{1}{c|}{- BRS}        & 0.27 $(\pm0.025)$     &     0.997 $(\pm0.006)$      &     0.259 $(\pm0.047)$      &     18.713 $(\pm4.000)$     &     0.910 $(\pm0.075)$      &     0.293 $(\pm0.074)$      &     1.025 $(\pm3.001)$      \\
  \multicolumn{1}{c|}{- IDS}        & 0.19 $(\pm0.003)$     &     1.000 $(\pm0.000)$      &     0.254 $(\pm0.034)$      &     20.300 $(\pm2.710)$     &     0.965 $(\pm0.082)$      &     0.255 $(\pm0.096)$      &     3.100 $(\pm3.604)$      \\
  \multicolumn{1}{c|}{- IREP}       & 0.03 $(\pm0.001)$     &     0.958 $(\pm0.019)$      &     0.375 $(\pm0.053)$      &     16.192 $(\pm4.380)$     &     0.954 $(\pm0.024)$      &     0.371 $(\pm0.046)$      &     3.577 $(\pm1.860)$      \\
  \multicolumn{1}{c|}{- PRISM}      & 0.13 $(\pm0.029)$     &     1.000 $(\pm0.000)$      &     0.188 $(\pm0.037)$      &     15.000 $(\pm2.981)$     &     0.976 $(\pm0.052)$      &     0.290 $(\pm0.115)$      &     3.800 $(\pm3.327)$      \\
  \multicolumn{1}{c|}{- Ripper}     & 0.04 $(\pm0.001)$     &     0.969 $(\pm0.009)$      &     0.333 $(\pm0.044)$      &     17.182 $(\pm3.462)$     &     0.876 $(\pm0.052)$      &     0.206 $(\pm0.100)$      &     -1.939 $(\pm2.228)$     \\
  \multicolumn{1}{l|}{Heart Disease} & & & & & & & \\
  \multicolumn{1}{c|}{- OPDT}       & 524.81 $(\pm534.168)$ &     0.979 $(\pm0.021)$      &     0.165 $(\pm0.047)$      &     29.200 $(\pm3.155)$     &     0.800 $(\pm0.063)$      &     0.180 $(\pm0.079)$      &     -9.200 $(\pm5.160)$     \\
  \multicolumn{1}{c|}{- BSCCART}    & 0.14 $(\pm0.019)$     &     0.926 $(\pm0.011)$      &     0.269 $(\pm0.021)$      &     16.800 $(\pm6.339)$     &     0.944 $(\pm0.074)$      &     0.250 $(\pm0.058)$      &     5.000 $(\pm13.199)$     \\
  \multicolumn{1}{c|}{- RSCRULES}   & 0.03 $(\pm0.005)$     &     0.969 $(\pm0.020)$      &     0.137 $(\pm0.052)$      &     20.500 $(\pm3.171)$     &     0.929 $(\pm0.097)$      &     0.118 $(\pm0.064)$      &     -0.900 $(\pm7.937)$     \\
  \multicolumn{1}{c|}{- BRS}        & 0.36 $(\pm0.049)$     &     0.949 $(\pm0.037)$      &     0.210 $(\pm0.079)$      &     15.010 $(\pm17.335)$    &     0.928 $(\pm0.089)$      &     0.200 $(\pm0.081)$      &     0.636 $(\pm11.493)$     \\
  \multicolumn{1}{c|}{- IDS}        & 0.20 $(\pm0.002)$     &     1.000 $(\pm0.000)$      &     0.064 $(\pm0.015)$      &     15.100 $(\pm3.510)$     &     0.955 $(\pm0.096)$      &     0.068 $(\pm0.020)$      &     2.100 $(\pm4.175)$      \\
  \multicolumn{1}{c|}{- IREP}       & 0.06 $(\pm0.001)$     &     0.903 $(\pm0.020)$      &     0.224 $(\pm0.039)$      &     -5.888 $(\pm11.122)$    &     0.827 $(\pm0.043)$      &     0.342 $(\pm0.025)$      &     -19.485 $(\pm10.521)$   \\
  \multicolumn{1}{c|}{- PRISM}      & 1.01 $(\pm0.040)$     &     1.000 $(\pm0.000)$      &     0.056 $(\pm0.008)$      &     13.300 $(\pm1.889)$     &     0.922 $(\pm0.130)$      &     0.053 $(\pm0.022)$      &     0.200 $(\pm4.442)$      \\
  \multicolumn{1}{c|}{- Ripper}     & 0.10 $(\pm0.001)$     &     0.944 $(\pm0.010)$      &     0.163 $(\pm0.046)$      &     13.592 $(\pm5.094)$     &     0.852 $(\pm0.042)$      &     0.208 $(\pm0.037)$      &     -4.524 $(\pm5.392)$     \\
  \multicolumn{1}{l|}{Heart Failure} & & & & & & & \\
  \multicolumn{1}{c|}{- OPDT}       & 729.32 $(\pm581.965)$ &     0.982 $(\pm0.017)$      &     0.331 $(\pm0.088)$      &     62.100 $(\pm5.859)$     &     0.935 $(\pm0.045)$      &     0.335 $(\pm0.124)$      &     5.100 $(\pm9.036)$      \\
  \multicolumn{1}{c|}{- BSCCART}    & 0.36 $(\pm0.008)$     &     0.927 $(\pm0.012)$      &     0.513 $(\pm0.178)$      &     29.600 $(\pm11.937)$    &     0.884 $(\pm0.053)$      &     0.530 $(\pm0.178)$      &     -1.200 $(\pm12.136)$    \\
  \multicolumn{1}{c|}{- RSCRULES}   & 0.03 $(\pm0.004)$     &     0.975 $(\pm0.023)$      &     0.097 $(\pm0.028)$      &     16.300 $(\pm4.296)$     &     0.817 $(\pm0.114)$      &     0.097 $(\pm0.064)$      &     -5.200 $(\pm4.211)$     \\
  \multicolumn{1}{c|}{- BRS}        & 0.30 $(\pm0.062)$     &     0.953 $(\pm0.096)$      &     0.238 $(\pm0.318)$      &     -41.974 $(\pm126.072)$  &     0.782 $(\pm0.106)$      &     0.239 $(\pm0.333)$      &     -19.254 $(\pm30.173)$   \\
  \multicolumn{1}{c|}{- IDS}        & 0.19 $(\pm0.003)$     &     1.000 $(\pm0.000)$      &     0.081 $(\pm0.008)$      &     19.400 $(\pm2.011)$     &     0.975 $(\pm0.079)$      &     0.080 $(\pm0.029)$      &     3.800 $(\pm3.853)$      \\
  \multicolumn{1}{c|}{- IREP}       & 0.07 $(\pm0.002)$     &     0.975 $(\pm0.018)$      &     0.103 $(\pm0.012)$      &     17.578 $(\pm5.046)$     &     0.823 $(\pm0.064)$      &     0.123 $(\pm0.037)$      &     -8.545 $(\pm8.333)$     \\
  \multicolumn{1}{c|}{- PRISM}      & 0.82 $(\pm0.078)$     &     1.000 $(\pm0.000)$      &     0.087 $(\pm0.012)$      &     20.700 $(\pm2.908)$     &     0.955 $(\pm0.096)$      &     0.067 $(\pm0.039)$      &     1.000 $(\pm4.944)$      \\
  \multicolumn{1}{c|}{- Ripper}     & 0.11 $(\pm0.002)$     &     0.994 $(\pm0.012)$      &     0.096 $(\pm0.006)$      &     21.412 $(\pm3.611)$     &     0.792 $(\pm0.063)$      &     0.088 $(\pm0.025)$      &     -6.730 $(\pm3.177)$     \\
  \multicolumn{1}{l|}{Hepatitis} & & & & & & & \\
  \multicolumn{1}{c|}{- OPDT}       & 0.82 $(\pm0.727)$     &     0.998 $(\pm0.006)$      &     0.675 $(\pm0.059)$      &     42.200 $(\pm3.393)$     &     0.907 $(\pm0.082)$      &     0.656 $(\pm0.107)$      &     0.500 $(\pm7.792)$      \\
  \multicolumn{1}{c|}{- BSCCART}    & 0.10 $(\pm0.028)$     &     0.986 $(\pm0.013)$      &     0.736 $(\pm0.053)$      &     40.100 $(\pm3.755)$     &     0.885 $(\pm0.069)$      &     0.737 $(\pm0.121)$      &     -2.200 $(\pm8.804)$     \\
  \multicolumn{1}{c|}{- RSCRULES}   & 0.01 $(\pm0.002)$     &     0.947 $(\pm0.024)$      &     0.664 $(\pm0.198)$      &     17.500 $(\pm4.223)$     &     0.873 $(\pm0.091)$      &     0.662 $(\pm0.238)$      &     -5.400 $(\pm8.449)$     \\
  \multicolumn{1}{c|}{- BRS}        & 0.27 $(\pm0.022)$     &     1.000 $(\pm0.000)$      &     0.458 $(\pm0.187)$      &     29.314 $(\pm11.940)$    &     0.962 $(\pm0.049)$      &     0.446 $(\pm0.186)$      &     3.429 $(\pm3.645)$      \\
  \multicolumn{1}{c|}{- IDS}        & 0.19 $(\pm0.027)$     &     1.000 $(\pm0.000)$      &     0.391 $(\pm0.085)$      &     25.000 $(\pm5.416)$     &     0.915 $(\pm0.133)$      &     0.381 $(\pm0.133)$      &     1.100 $(\pm7.310)$      \\
  \multicolumn{1}{c|}{- IREP}       & 0.06 $(\pm0.001)$     &     0.967 $(\pm0.014)$      &     0.646 $(\pm0.054)$      &     26.330 $(\pm4.966)$     &     0.970 $(\pm0.020)$      &     0.475 $(\pm0.143)$      &     4.554 $(\pm2.280)$      \\
  \multicolumn{1}{c|}{- PRISM}      & 0.15 $(\pm0.017)$     &     1.000 $(\pm0.000)$      &     0.341 $(\pm0.133)$      &     21.800 $(\pm8.496)$     &     0.881 $(\pm0.312)$      &     0.294 $(\pm0.213)$      &     2.700 $(\pm1.889)$      \\
  \multicolumn{1}{c|}{- Ripper}     & 0.07 $(\pm0.001)$     &     0.973 $(\pm0.012)$      &     0.615 $(\pm0.045)$      &     28.004 $(\pm4.501)$     &     0.988 $(\pm0.024)$      &     0.430 $(\pm0.174)$      &     5.664 $(\pm2.723)$      \\
  \multicolumn{1}{l|}{Indian Liver Patient} & & & & & & & \\
  \multicolumn{1}{c|}{- OPDT}       & 999.70 $(\pm224.994)$ &     0.981 $(\pm0.008)$      &     0.221 $(\pm0.024)$      &     82.500 $(\pm10.069)$    &     0.948 $(\pm0.043)$      &     0.239 $(\pm0.028)$      &     12.700 $(\pm10.843)$    \\
  \multicolumn{1}{c|}{- BSCCART}    & 0.61 $(\pm0.015)$     &     0.942 $(\pm0.022)$      &     0.297 $(\pm0.040)$      &     54.500 $(\pm24.968)$    &     0.907 $(\pm0.036)$      &     0.313 $(\pm0.056)$      &     2.300 $(\pm13.679)$     \\
  \multicolumn{1}{c|}{- RSCRULES}   & 0.04 $(\pm0.015)$     &     0.996 $(\pm0.013)$      &     0.044 $(\pm0.006)$      &     19.600 $(\pm2.716)$     &     0.986 $(\pm0.045)$      &     0.045 $(\pm0.013)$      &     4.200 $(\pm2.860)$      \\
  \multicolumn{1}{c|}{- BRS}        & 0.27 $(\pm0.043)$     &     0.731 $(\pm0.031)$      &     0.902 $(\pm0.088)$      &     -790.932 $(\pm96.273)$  &     0.730 $(\pm0.032)$      &     0.896 $(\pm0.087)$      &     -198.454 $(\pm24.312)$  \\
  \multicolumn{1}{c|}{- IDS}        & 0.21 $(\pm0.003)$     &     1.000 $(\pm0.000)$      &     0.038 $(\pm0.005)$      &     17.400 $(\pm2.366)$     &     0.975 $(\pm0.079)$      &     0.039 $(\pm0.017)$      &     3.500 $(\pm3.866)$      \\
  \multicolumn{1}{c|}{- IREP}       & 0.09 $(\pm0.002)$     &     0.995 $(\pm0.005)$      &     0.104 $(\pm0.008)$      &     42.624 $(\pm7.690)$     &     0.959 $(\pm0.021)$      &     0.113 $(\pm0.021)$      &     7.045 $(\pm5.418)$      \\
  \multicolumn{1}{c|}{- PRISM}      & 2.98 $(\pm0.054)$     &     1.000 $(\pm0.000)$      &     0.039 $(\pm0.006)$      &     18.100 $(\pm2.923)$     &     1.000 $(\pm0.000)$      &     0.038 $(\pm0.022)$      &     4.400 $(\pm2.591)$      \\
  \multicolumn{1}{c|}{- Ripper}     & 0.14 $(\pm0.001)$     &     0.999 $(\pm0.001)$      &     0.101 $(\pm0.001)$      &     46.088 $(\pm0.820)$     &     0.775 $(\pm0.060)$      &     0.064 $(\pm0.014)$      &     -7.918 $(\pm3.601)$     \\
  \multicolumn{1}{l|}{Parkinsons} & & & & & & & \\
  \multicolumn{1}{c|}{- OPDT}       & 863.11 $(\pm340.952)$ &     0.999 $(\pm0.003)$      &     0.532 $(\pm0.032)$      &     82.000 $(\pm4.447)$     &     0.919 $(\pm0.052)$      &     0.508 $(\pm0.077)$      &     2.800 $(\pm9.682)$      \\
  \multicolumn{1}{c|}{- BSCCART}    & 1.77 $(\pm0.116)$     &     0.977 $(\pm0.018)$      &     0.597 $(\pm0.077)$      &     70.200 $(\pm10.633)$    &     0.884 $(\pm0.070)$      &     0.621 $(\pm0.109)$      &     -5.800 $(\pm18.207)$    \\
  \multicolumn{1}{c|}{- RSCRULES}   & 0.03 $(\pm0.007)$     &     0.976 $(\pm0.009)$      &     0.240 $(\pm0.014)$      &     28.500 $(\pm3.689)$     &     0.990 $(\pm0.032)$      &     0.246 $(\pm0.058)$      &     8.600 $(\pm3.777)$      \\
  \multicolumn{1}{c|}{- BRS}        & 0.37 $(\pm0.061)$     &     1.000 $(\pm0.000)$      &     0.335 $(\pm0.058)$      &     52.247 $(\pm9.034)$     &     0.978 $(\pm0.027)$      &     0.315 $(\pm0.073)$      &     9.064 $(\pm4.337)$      \\
  \multicolumn{1}{c|}{- IDS}        & 0.19 $(\pm0.004)$     &     1.000 $(\pm0.000)$      &     0.163 $(\pm0.033)$      &     25.400 $(\pm5.103)$     &     0.876 $(\pm0.166)$      &     0.154 $(\pm0.091)$      &     0.000 $(\pm4.397)$      \\
  \multicolumn{1}{c|}{- IREP}       & 0.35 $(\pm0.002)$     &     0.998 $(\pm0.003)$      &     0.116 $(\pm0.009)$      &     17.560 $(\pm1.554)$     &     0.983 $(\pm0.012)$      &     0.120 $(\pm0.018)$      &     4.056 $(\pm0.831)$      \\
  \multicolumn{1}{c|}{- PRISM}      & 1.56 $(\pm0.140)$     &     1.000 $(\pm0.000)$      &     0.171 $(\pm0.030)$      &     26.700 $(\pm4.668)$     &     0.954 $(\pm0.062)$      &     0.185 $(\pm0.067)$      &     3.200 $(\pm3.910)$      \\
  \multicolumn{1}{c|}{- Ripper}     & 0.41 $(\pm0.003)$     &     1.000 $(\pm0.000)$      &     0.117 $(\pm0.010)$      &     18.164 $(\pm1.515)$     &     1.000 $(\pm0.001)$      &     0.116 $(\pm0.018)$      &     4.599 $(\pm0.687)$      \\
  \multicolumn{1}{l|}{Pima Indians Diabetes} & & & & & & & \\
  \multicolumn{1}{c|}{- OPDT}       & 1047.67 $(\pm76.896)$ &     0.979 $(\pm0.012)$      &     0.187 $(\pm0.032)$      &     89.100 $(\pm6.590)$     &     0.925 $(\pm0.049)$      &     0.201 $(\pm0.047)$      &     4.900 $(\pm12.758)$     \\
  \multicolumn{1}{c|}{- BSCCART}    & 0.76 $(\pm0.010)$     &     0.922 $(\pm0.039)$      &     0.297 $(\pm0.105)$      &     36.200 $(\pm46.492)$    &     0.889 $(\pm0.038)$      &     0.309 $(\pm0.117)$      &     -8.400 $(\pm21.287)$    \\
  \multicolumn{1}{c|}{- RSCRULES}   & 0.07 $(\pm0.003)$     &     0.962 $(\pm0.016)$      &     0.105 $(\pm0.041)$      &     37.400 $(\pm8.897)$     &     0.953 $(\pm0.047)$      &     0.097 $(\pm0.055)$      &     4.900 $(\pm7.340)$      \\
  \multicolumn{1}{c|}{- BRS}        & 0.35 $(\pm0.025)$     &     0.727 $(\pm0.006)$      &     0.859 $(\pm0.022)$      &     -920.728 $(\pm36.483)$  &     0.722 $(\pm0.022)$      &     0.863 $(\pm0.037)$      &     -237.896 $(\pm36.960)$  \\
  \multicolumn{1}{c|}{- IDS}        & 0.22 $(\pm0.008)$     &     1.000 $(\pm0.000)$      &     0.035 $(\pm0.010)$      &     21.500 $(\pm6.042)$     &     0.963 $(\pm0.078)$      &     0.037 $(\pm0.019)$      &     3.700 $(\pm5.250)$      \\
  \multicolumn{1}{c|}{- IREP}       & 0.08 $(\pm0.001)$     &     0.938 $(\pm0.013)$      &     0.100 $(\pm0.005)$      &     22.686 $(\pm6.723)$     &     0.837 $(\pm0.026)$      &     0.093 $(\pm0.016)$      &     -9.601 $(\pm3.598)$     \\
  \multicolumn{1}{c|}{- PRISM}      & 3.35 $(\pm0.137)$     &     1.000 $(\pm0.000)$      &     0.037 $(\pm0.015)$      &     22.800 $(\pm8.942)$     &     0.924 $(\pm0.107)$      &     0.049 $(\pm0.021)$      &     1.600 $(\pm8.859)$      \\
  \multicolumn{1}{c|}{- Ripper}     & 0.13 $(\pm0.003)$     &     0.980 $(\pm0.015)$      &     0.064 $(\pm0.019)$      &     26.876 $(\pm5.161)$     &     0.802 $(\pm0.057)$      &     0.056 $(\pm0.020)$      &     -8.958 $(\pm4.647)$     \\
  \multicolumn{1}{l|}{Thoracic Surgery} & & & & & & & \\
  \multicolumn{1}{c|}{- OPDT}       & 133.21 $(\pm327.588)$ &     0.949 $(\pm0.012)$      &     0.264 $(\pm0.058)$      &     46.200 $(\pm5.095)$     &     0.855 $(\pm0.079)$      &     0.247 $(\pm0.037)$      &     -9.800 $(\pm16.430)$    \\
  \multicolumn{1}{c|}{- BSCCART}    & 0.10 $(\pm0.047)$     &     0.991 $(\pm0.015)$      &     0.064 $(\pm0.052)$      &     20.200 $(\pm15.483)$    &     0.521 $(\pm0.449)$      &     0.066 $(\pm0.054)$      &     -3.800 $(\pm4.392)$     \\
  \multicolumn{1}{c|}{- RSCRULES}   & 0.02 $(\pm0.006)$     &     0.937 $(\pm0.010)$      &     0.244 $(\pm0.084)$      &     31.700 $(\pm7.009)$     &     0.867 $(\pm0.059)$      &     0.214 $(\pm0.091)$      &     -4.900 $(\pm13.304)$    \\
  \multicolumn{1}{c|}{- BRS}        & 2.13 $(\pm5.802)$     &     0.898 $(\pm0.015)$      &     0.706 $(\pm0.127)$      &     -21.068 $(\pm45.358)$   &     0.863 $(\pm0.023)$      &     0.717 $(\pm0.143)$      &     -27.672 $(\pm12.292)$   \\
  \multicolumn{1}{c|}{- IDS}        & 0.21 $(\pm0.003)$     &     1.000 $(\pm0.000)$      &     0.038 $(\pm0.011)$      &     14.400 $(\pm4.169)$     &     0.933 $(\pm0.200)$      &     0.031 $(\pm0.022)$      &     -0.100 $(\pm8.950)$     \\
  \multicolumn{1}{c|}{- IREP}       & 0.05 $(\pm0.002)$     &     0.902 $(\pm0.014)$      &     0.322 $(\pm0.049)$      &     -2.136 $(\pm18.640)$    &     0.862 $(\pm0.016)$      &     0.445 $(\pm0.072)$      &     -12.724 $(\pm4.690)$    \\
  \multicolumn{1}{c|}{- PRISM}      & 1.70 $(\pm0.090)$     &     1.000 $(\pm0.000)$      &     0.045 $(\pm0.005)$      &     16.900 $(\pm1.912)$     &     1.000 $(\pm0.000)$      &     0.048 $(\pm0.020)$      &     4.500 $(\pm1.841)$      \\
  \multicolumn{1}{c|}{- Ripper}     & 2.16 $(\pm6.556)$     &     0.914 $(\pm0.023)$      &     0.250 $(\pm0.044)$      &     26.356 $(\pm10.267)$    &     0.835 $(\pm0.069)$      &     0.130 $(\pm0.050)$      &     -8.167 $(\pm5.930)$     \\
\end{xltabular}
\normalsize

\section{Feature Grouping Using ML-Based Feature Importance}

As a concrete demonstration that prior information extracted from ML can help feature grouping, we select the German dataset, which is the largest dataset among those in Table \ref{table:UCI_dataset}. Suppose that we need to find a pattern complying with the \texttt{\{categorical\}--\{numerical\}} structure. First, feature importance is extracted from a LightGBM (LGBM) model \citep{ke2017lightgbm} trained with 1,000 boosting iterations. Figure \ref{figure:OPDT_LGBM} presents the feature importance scores evaluated by two complementary metrics: information gain and split count. Based on this prior information, we select the top six features from each metric, as highlighted in red in Figure \ref{figure:OPDT_LGBM}, to define two feature groups. Table \ref{table:OPDT_LGBM_improvement} shows the VI improvement over time with a runtime limit of 600 seconds. As shown in Table \ref{table:OPDT_LGBM_improvement}, OPDT with LGBM reaches the optimal VI of $41.0$ in 5 seconds, whereas OPDT Only reaches the same optimal VI in 23 seconds, and reduces the total runtime to prove optimality from $446.4$ seconds to $90.8$ seconds, which is a fivefold reduction in runtime. This experiment demonstrates a heuristic approach in which an ML method is used to extract feature importance rankings to define feature groups for the BSC framework.

\begin{figure}[htbp]
\centering
\includegraphics[width=0.8\columnwidth]{figures/OPDT_04}
\caption{Feature grouping using LGBM feature importance (gain and split) on the German dataset. Features are ranked by information gain (top) and split count (bottom). The numerical and categorical features within the top six (highlighted in red) are selected for the \texttt{\{cat\}--\{num\}} rule.}
\label{figure:OPDT_LGBM}
\end{figure}

\footnotesize
\setlength{\tabcolsep}{3pt}
\renewcommand{\arraystretch}{1.0}
\begin{xltabular}{0.8\textwidth}{@{} >{\hsize=0.3\hsize\centering\arraybackslash}X | >{\hsize=0.5\hsize\centering\arraybackslash}X >{\hsize=0.5\hsize\centering\arraybackslash}X | >{\hsize=0.5\hsize\centering\arraybackslash}X >{\hsize=0.5\hsize\centering\arraybackslash}X @{}}
\caption{VI improvement over time on the German dataset for the \texttt{\{cat\}--\{num\}} rule. OPDT with LGBM denotes OPDT using feature groups selected by LightGBM feature importance, while OPDT Only denotes OPDT without feature grouping. Each row shows the step at which VI improves and the corresponding elapsed time in seconds.} \label{table:OPDT_LGBM_improvement} \\
  \toprule
  & \multicolumn{2}{c|}{OPDT with LGBM} & \multicolumn{2}{c}{OPDT Only} \\
  \cmidrule(lr){2-3} \cmidrule(lr){4-5}
  \multicolumn{1}{c}{Step} & VI & Time (sec) & VI & Time (sec) \\
  \endfirsthead

  \multicolumn{5}{c}{Table \thetable{} continued from previous page} \\
  \toprule
  & \multicolumn{2}{c|}{OPDT with LGBM} & \multicolumn{2}{c}{OPDT Only} \\
  \cmidrule(lr){2-3} \cmidrule(lr){4-5}
  \multicolumn{1}{c}{Step} & VI & Time (sec) & Best VI & Time (sec) \\
  \midrule
  \endhead

  \midrule
  \multicolumn{5}{r}{Continued on next page} \\
  \endfoot

  \bottomrule
  \endlastfoot
  \midrule
  1 &  0.0 & \hphantom{0}0 &  0.0 & \hphantom{0}0 \\
  2 &  1.0 & \hphantom{0}1 &  2.0 & \hphantom{0}1 \\
  3 &  2.0 & \hphantom{0}1 &  5.0 & \hphantom{0}2 \\
  4 &  4.0 & \hphantom{0}2 & 14.0 & \hphantom{0}5 \\
  5 &  5.0 & \hphantom{0}2 & 20.0 & 21 \\
  6 & 12.0 & \hphantom{0}3 & 41.0 & 23 \\
  7 & 41.0 & \hphantom{0}5 & ---  & --- \\
  \midrule
  \textbf{Optimal} & \textbf{41.0} & \textbf{\hphantom{0}90.8} & \textbf{41.0} & \textbf{446.4} \\
\end{xltabular}
\normalsize

\end{document}