%\documentclass{uai2025} % for initial submission
\documentclass[accepted]{uai2025} % after acceptance, for a revised version; 
% also before submission to see how the non-anonymous paper would look like 
                        
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2025} % ptmx math instead of Computer
                                         % Modern (has noticeable issues)
% \documentclass[mathfont=newtx]{uai2025} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    % \renewcommand{\bibsection}{\subsubsection*{References}}
    \renewcommand{\bibsection}{\subsubsection{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

\usepackage{mdframed}
\usepackage{amsmath,amssymb,amsfonts}
\usepackage{graphicx} % Required for inserting images
\usepackage[table]{xcolor}
\usepackage{microtype}
\usepackage{amsfonts}
% \usepackage{cite}
\usepackage{amsmath,amssymb,amsfonts}
\usepackage{mathtools}
\usepackage{mdframed}
\usepackage{enumitem}
\usepackage{wrapfig}
\usepackage{subcaption}
\usepackage{multirow}
\usepackage[algoruled, linesnumbered, noend]{algorithm2e}
\SetKwIF{IfNot}{ElseIfNot}{}{if not}{then}{else if not}{}{}
\newcommand\mycommfont[1]{\footnotesize\ttfamily\textcolor{green!50!black}{#1}}
\SetCommentSty{mycommfont}
\usepackage[misc]{ifsym}

\usepackage{float}
\floatstyle{plaintop}
\restylefloat{table}

\usepackage{siunitx}

\usepackage{booktabs,siunitx}
\usepackage{pgf}





\include{macros}

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros

\title{Symbiotic Local Search for Small Decision Tree Policies in MDPs}

% The standard author block has changed for UAI 2025 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{Roman~Andriushchenko}
\author[1]{\href{mailto:<ceskam@fit.vut.cz>?Subject=Your UAI 2025 paper}{Milan~\v{C}e\v{s}ka}{}}
\author[2]{Debraj~Chakraborty}
\author[3]{Sebastian~Junges}
\author[2,4]{Jan~K\v{r}etínský}
\author[1]{Filip~Macák}
% Add affiliations after the authors
\affil[1]{%
    Brno University of Technology, Czechia
}
\affil[2]{%
    Masaryk University, Czechia
}
\affil[3]{%
    Radboud University Nijmegen, the Netherlands
}
\affil[4]{%
    Technical University of Munich, Germany
}
  
\begin{document}
\maketitle

\begin{abstract}
We study decision making policies in Markov decision processes (MDPs). Two key performance indicators of such policies are their value and their interpretability. On the one hand, policies that optimize value can be efficiently computed via a plethora of standard methods. 
However, the representation of these policies may prevent their interpretability.
On the other hand, policies with good interpretability, such as policies represented by a small decision tree, are computationally hard to obtain. 
This paper contributes a local search approach to find policies with good value, represented by small decision trees. Our local search symbiotically combines learning decision trees from value-optimal policies with symbolic approaches that optimize the size of the decision tree within a constrained neighborhood. 
Our empirical evaluation shows that this combination provides drastically smaller decision trees for MDPs that are significantly larger than what can be handled by optimal decision tree learners.
\end{abstract}


\begin{figure*}[t]
    \centering
    \includegraphics[width=0.9\linewidth]{figures/simple-scheme-single-column-with-pics.png}
    \caption{Our proposed algorithm \integration visualised.
}
    \label{fig:integration}
\end{figure*}




\section{Introduction}\label{sec:intro}
Markov decision processes are \emph{the} standard model for decision making under uncertainty. Policies describe for every state which action to take. 
Their \emph{value} is the expected return of executing the policy on the MDP. Traditionally, the focus in MDPs is to compute value-optimal policies. However, computing policies is often part of a larger methodology, in which \emph{interpretability} of the policy by either a human or a machine is essential. The objectives to compute interpretable and high-value policies are conflicting and can be resolved in different ways. This paper contributes a local search that finds small decision tree (DT) representations for almost value-optimal policies.

(Value-)optimal policies can be computed efficiently using classical algorithms such as value iteration, policy iteration, or linear programming~\citep{Put94}.
Using these algorithms provides a \emph{tabular} representation of the policy which scales linearly with the number of states in the MDP.
In contrast, \emph{neural} policies represent the policy as a neural network~\citep{bertsekas1996neuro}. They can be efficiently learned due to their differentiability, however, while neural networks may be concise, reasoning about the behavior of a policy represented by a neural network is challenging for humans and for machines. 
Finally, rule-based policy representations~\citep{DBLP:conf/aaai/GuptaTB15,DBLP:conf/icml/VermaMSKC18,batz2024programmatic}, in particular policies represented as DTs~\citep{bastani2018verifiable,DBLP:conf/aaai/TopinMFV21,vos2023optimal}, are promising due to their interpretability and generalisability.   However, such policies are not differentiable. Consequentially, computing interpretable high-value policies is computationally intractable for complex MDPs~\citep{vos2023optimal,andriushchenko2025small}.  

This paper studies a tractable local search towards learning interpretable yet high-value policies, represented as DTs. It symbiotically combines two main lines of research from the literature: (1) \emph{(Data-driven) policy mapping}, which employs DT learning heuristics to obtain trees that match a value-optimal policy in the most relevant decisions~\citep{DBLP:conf/hybrid/AshokJJKWZ20,ashok2021dtcontrol}, and (2) \emph{bounded-depth (policy) tree learning} using symbolic reasoning that computes potentially value-optimal small DTs~\citep{vos2023optimal,andriushchenko2025small}.  

The main approach to \emph{data-driven policy mapping}, as implemented e.g. in \dtcontrol~\cite{ashok2021dtcontrol}, consists of two sequential stages. 
First, an (almost) optimal MDP policy is computed using off-the-shelf tools. Then, greedy algorithms are used to represent this policy as a DT.
However, despite various tweaks, the size of the final DT depends significantly on the initial policy and the computation of that policy is not incentivized to obtain policies which can be represented by a small DT.
The main approach to \emph{bounded-depth policy tree learning} is to construct and solve a constraint system. This works either as an MILP encoding~\citep{vos2023optimal}  or by an iterative abstraction-refinement loop that avoids solving a monolithic MILP~\citep{andriushchenko2025small}. Both approaches work with up to trees of depth $8$ and the latter approach also allows working with MDPs with up to millions of states. However, these comes at significant computational cost and finding trees of depth $8$ for large MDPs is beyond their current abilities. Section~\ref{sec:sota} contains additional information about these approaches.

We propose the local search algorithm \integration that symbiotically combines the approaches above. Figure~\ref{fig:integration} illustrates the main steps.
\integration computes a value-optimal policy and represents it as a DT $\mathcal{T}$ with a data-driven policy learner. We then iteratively improve upon $\mathcal{T}$. We first create a neighborhood around $\mathcal{T}$ containing DTs that are close to $\mathcal{T}$. Specifically, these trees differ in one subtree only. As the neighborhood is much smaller than the space of all decision trees, searching this tree using a bounded-depth tree learner is significantly faster than exploring the space of all trees. This local search converges against a local optimum. To jump out of such an optimum, we perturb the tree in different ways, before pruning another subtree. For this perturbation, we can use data-driven learning, which now starts with a policy that already was incentivized to be concise. 
On the technical level, we are combining three state-of-the-art tools: \dtcontrol \cite{ashok2021dtcontrol} for policy mapping, \dtpaynt \cite{andriushchenko2025small} for bounded-depth optimization, and the model checker \storm \cite{STORM} to obtain the initial optimal policy and to evaluate the intermediate candidate policies.

Our experiments confirm that \integration combines the strengths of both approaches. It can handle MDPs (and DTs) that are orders of magnitude larger than what a purely symbolic approach can handle, while it constructs DTs that are significantly smaller than the mapped optimal policies at a cost of only a few percent relative error. In particular, in 9 out of 13 benchmarks and with a timeout of one hour, we have obtained trees of sizes at most 25 (in our opinion within reach of explainability), while previously they were larger than that.



\section{Preliminaries and Problem~Statement}
A \emph{distribution} over a countable set $A$ is a~function $\mu \colon A \rightarrow \unitinterval$ s.t.~$\sum_a \mu(a) {=} 1$.
$a \sim \mu$ denotes $\mu(a) > 0$.
The set $\Distr(A)$ contains all distributions over $A$.
\begin{definition}[MDP]
A \emph{Markov decision process (MDP)} is a tuple $M = \mdpT$ with a finite set $S$ of states, an initial state $\sinit \in S$, a finite (indexed) set $\Act$ of actions, a partial transition function $\mpm \colon S \times \Act \nrightarrow \Distr(S)$, a reward function $\mrm \colon S \times \Act \rightarrow \reals$, and a discount factor $\df\in [0,1)$.
\end{definition}

For an MDP $M$, we define the \emph{available actions} in  $s \in S$ as $\Act(s) \coloneqq$ $ \{ \act \in \Act \mid \mpm(s,\act) \neq \bot \}$;
we denote $\mpm(s,\act,s') \coloneqq \mpm(s,\act)(s')$.
An MDP with $|\Act(s)|=1$ for each $s \in S$ is a \emph{Markov chain (MC)};
we denote MCs as tuples \mcT.
We assume that the states in an MDP are factored, i.e., composed of multiple features. 
Each feature is defined by a bounded integer variable from the set of variables \variables.
\emph{State predicates} are inequalities of the form $v \leq b$ with $v \in \variables$ and $b \in \integers$; the set of such predicates is denoted $\predicates$.
A state $s$ \emph{satisfies a predicate} $v \leq b$ iff $s(v) \leq b$; we denote this with $s \models (v \leq b)$.


\begin{figure*}[t]
    \begin{subfigure}{0.26\linewidth}
        \centering
        \includegraphics[width=\linewidth]{figures/mdp-maze-example.pdf}
        \caption{}
        \label{fig:example:mdp}
    \end{subfigure}
    \begin{subfigure}{0.31\linewidth}
        \centering
        \includegraphics[width=\linewidth]{figures/dt-example-suboptimal.pdf}
        \caption{}
        \label{fig:example:dt-suboptimal}
    \end{subfigure}
    \begin{subfigure}{0.41\linewidth}
        \centering
        \includegraphics[width=\linewidth]{figures/dt-example-optimal.pdf}
        \caption{}
        \label{fig:example:dt-optimal}
    \end{subfigure}
    \caption{(a) Grid-world MDP example. The agent first chooses in which ``green'' cell it starts and aims to reach the target state while minimizing the number of steps. (b) A DT that represents a policy that reaches the target state in 7 steps showcased by the blue arrow in (a). (c) A DT that represents the optimal policy which reaches the target state in 6 steps showcased by the orange arrow in (a).}
    \label{fig:example}
\end{figure*}

\begin{example}
\label{example:mdp}
    Take the grid-world environment from Fig.~\ref{fig:example:mdp}. The agent initially chooses from which ``green'' cell it starts. In the grid, the agent can move into any of the four cardinal directions, however, there are walls in the grid restricting some of the movements of the agent. The goal is to minimize the number of steps it takes to reach the target. The optimal solution is 6 steps (the path highlighted with the orange arrow).  We can model this environment as an MDP $M$ with 26 states (each states represents one grid cell and one initial state), with actions allowing the movement $\{\uparrow,\rightarrow,\downarrow,\leftarrow\}$ and the choice of the initial position $\{S_1,\dots,S_5\}$. There are three state variables in this MDP $\variables = \{x,y,\text{init}\}$. The initial state is the only state where it holds that $\text{init}=1$, and for example, the target state corresponds to the variable assignment $x=5,y=2,\text{init}=0$.
\end{example}

\paragraph{Policies.}
A (deterministic, memoryless) \emph{policy} is a function $\sched \colon S \rightarrow \Act$.
The set $\schedulers$ contains the policies for MDP $M$.
A~policy $\sched \in \schedulers$ induces the MC~$\imc = \left(S,s_0,\mpm^{\sched},\mrm,\df \right)$ where \mbox{$\mpm^{\sched}(s) = \mpm(s,\sched(s))$}.
To compactly represent policies, it is convenient to omit actions defined for states that are unreachable in \imc.
The \emph{partial restriction} of a policy $\sched$ is a partial function
$\parres[\sched] \colon S \nrightarrow \Act$ where $\parres[\sched](s) \neq \bot$ iff state $s$ is reachable (from \sinit) in \imc.
The goal in an MDP is to find an optimal policy $\schedopt$ that maximizes the expected cumulative reward~\citep{Put94} over time. 
Formally, for a policy $\sched$, we define its \emph{value} 
$\val_{M}(\sched) = \E\left[\sum_{t=0}^{\infty}\df^t\mrm(s_t)\right]$
where $s_{t+1}\sim \mpm(s_t,\sched(s_t))$;
we omit subscript $M$ whenever the MDP is clear from the context.
The optimal policy $\schedopt$ is then $\schedopt \in \argmax_{\sched \in \schedulers} \val_{M}(\pi)$.
Our approach also supports \emph{maximal reachability probability} and other temporal properties~\citep{Baier2018}.


\paragraph{The random action. }
To concisely represent policies, it is convenient to allow a policy to take some dedicated \emph{arbitrary} action. We explicate this arbitrary action  $\actrandom$ for every state that uniformly selects one of the (available) actions. Formally, we define $M' = (S,\sinit,\Act',\mpm',\mrm,\df)$ with $\Act'(s)=\Act(s)\cup\{\actrandom\}$, $\mpm'(s,\actrandom,s') = \frac{1}{|\Act(s)|}\sum_{\act} \mpm(s,\act,s')$. 
Henceforth, we assume that MDP $M' = M$, i.e., that every MDP contains an action $\actrandom$.

\paragraph{Correctness and interpretability of the random action.} 
Adding the random action makes it explicit that a policy may randomly pick either available action. Sometimes, having this opportunity makes for more concise policies. A possible downside is that the policy in $M'$ may not reflect a memoryless deterministic policy in $M$. 
The addition of random action does not effect the values achievable in the MDP, i.e., adding the random action is sound. For a formal proof refer to~\cite{andriushchenko2025small}.


\paragraph{Trees.}
A (binary) \emph{tree} is a tuple 
$T = (n_0,N,L,l,r)$  with the \emph{root node} $n_0$, the set~$N$ of \emph{inner nodes}, the set $L$ of \emph{leaf nodes}, and functions $l,r \colon N \rightarrow N\cup L $ defining the \emph{left} and \emph{right successors} of the inner nodes, respectively. 
A \emph{path} (of length $k$) in a tree~$T$ is a sequence $\sched = n_0\ldots n_k$ of nodes s.t.\ $\forall 0 {<} i {\leq} k: n_i \in \{l(n_{i-1}),r(n_{i-1})\}$.
Path $\sched$ is \emph{complete} if it ends in a leaf node.
The \emph{depth} of $T$, $\mathsf{Depth}(\tree)$, is the length of its longest path.
The \emph{size} of $T$ is the number of its inner nodes.
 
\begin{definition}[Decision tree]
Assume an MDP $M = \mdpT$ with state features defined over the set~\variables.
A \emph{decision tree} (DT) for $M$ is a tuple $\tree = \treeT$ where (i)~$T$ is a binary tree, (ii)~\emph{predicate function} $\innerlabel \colon N \rightarrow \predicates$ assigns to inner nodes a state predicate, and (iii)~\emph{action function} $\leaflabel \colon L \rightarrow \Act$ assigns to leaf nodes an action. 
\end{definition}
%
We lift the notions of inner and leaf nodes, paths, depths, and sizes of trees to DTs.

\begin{definition}[Corresponding states]
The set $\sts{\tree}{n}$ of states that corresponds to a node $n$ is recursively defined as follows: $\sts{\tree}{n_0} = S$, $\sts{\tree}{l(n)} = \{ s \in \sts{\tree}{n} \mid s \models \innerlabel(n) \}$, and $\sts{\tree}{r(n)} = \{ s \in \sts{\tree}{n} \mid s \not\models \innerlabel(n) \}$.
\end{definition}
Note that the sets $\{ \sts{\tree}{n} \mid n \in L\}$ represent a partition of $S$. Thus, we can define $\leaf(s)$ as the unique leaf node $n \in L$ such that $s \in \sts{\tree}{n}$.


\begin{example}
\label{example:dt}
   Fig.~\ref{fig:example:dt-suboptimal} contains an example of a decision tree for MDP from Example~\ref{example:mdp}. The inner nodes are represented by the rounded shapes and contain state predicates, and the leaf nodes are represented by rectangles and contain the chosen action. For example, the root node contains the state predicate $\text{init}\leq0$ which separates the initial state from the states representing cells in the grid. The middle leaf node with the action $\downarrow$ corresponds to the states which represent the cells in the last column of the grid as these are the states which satisfy the predicate $\text{init}\leq0$ and do not satisfy the predicate $x\leq4$.
\end{example}

%

\begin{definition}[Induced policy]
\label{def:induced-policy}
The DT $\tree$ \emph{induces policy} $\sched[\tree] \colon S \rightarrow \Act$ with $\sched[\tree](s) = \leaflabel(\leaf(s))$ if $\leaflabel(\leaf(s)) \in Act(s)$ and $\sched[\tree](s) = \actrandom$ otherwise. The \emph{value} $V(\tree)$ of the DT \tree is defined as the value of $\sched[\tree]$.
\end{definition}

\paragraph{Fallback action interpretability.}
    Our approach can synthesize a DT that assigns an action that is not available at a given state, in which case we use the random action as a fallback. This setup avoids that the DT must precisely capture when an action is available and allows for smaller DTs. 
    Note that the information 
    which actions are available is usually also accessible in another format (e.g. masks or shields heavily used in reinforcement learning settings). 

\paragraph{Admissible error.} Instead of learning a tree that induces the optimal policy~\schedopt, to give more room for the tree size reduction, we admit trees having near-optimal value. We define the \emph{normalized relative error} of a tree \tree as
$$\terror(\tree) \coloneqq \frac{V(\schedopt)-V(\tree)}{V(\schedopt)-V(\schedrand)},$$
%
where $\schedrand$ is a randomized policy with $\schedrand(s) = \actrandom$ for every $s \in S$.
We include $V(\schedrand)$ as our lower bound for a tree value to provide a clearer sense of how distant a candidate $\tree$ is from an arbitrary one.
Given relative error \error, we call the tree \emph{admissible} if $\terror(\tree) \leq \error$. 
Note that using the value of worst policy as a lower bound would mean the error bound would not be very tight. Note that similar metrics have also been used recently~\cite{vos2023optimal, andriushchenko2025small}.

\begin{example}
    Consider the MDP from Example~\ref{example:mdp}. The optimal policy in this MDP reaches the target in 6 steps and follows the orange arrow in Fig.~\ref{fig:example:mdp}. The minimal decision tree representing this policy has 5 decision nodes and is shown in Fig.~\ref{fig:example:dt-optimal}. If we allow some admissible error which would allow us to find policies that take 7 steps to reach a target, we can find a DT with just 2 decision nodes for such a policy. This policy follows the blue arrow in Fig.~\ref{fig:example:mdp} and the DT representing this policy is shown in Fig.~\ref{fig:example:dt-suboptimal}.
\end{example}

\subsubsection{Problem Statement.}
We want to find a (near)-optimal policy represented by a small tree (in terms of the number of inner nodes):
\begin{mdframed}
\textbf{Bounded-value policy tree learning}: Given MDP $M$ and relative error \error, find a smallest DT \tree with $\terror(\tree) \leq \error$.
\end{mdframed}
Typically, finding the smallest DT is computationally intractable for complex MDPs. Therefore, akin to~\cite{vos2023optimal,andriushchenko2025small}, we are interested in an anytime algorithm: the faster we find a small DT that is within the error bound, the better.



\section{State of the Art}
\label{sec:sota}
We briefly recap two types of methodologies used to learn policy trees that are the fundament for our local search. Other related approaches are discussed in the related work.


\paragraph{Policy mapping.}
Policy mapping takes a fixed, typically value-optimal, policy and maps it to a DT. Most prominent for MDPs are greedy heuristics as implemented in \mbox{\dtcontrol~\citep{ashok2021dtcontrol}} and \uppaal~\citep{david2015uppaal,DBLP:conf/qest/AshokKLCTW19}.  
The DTs are learnt from a dataset representing the policy where the states in $S$ are the input and the suggested actions in $\Act$ are the output.
This uses algorithms like CART~\citep{cart} or ID3~\citep{id3} which recursively split the dataset by evaluating different predicates by calculating some impurity measure~\citep{mitchell97} (like information gain or Gini index) and then greedily picking the most promising one.
The result is an approximation or exact representation of the original policy. 
Generally, these tools favor scalability over minimality. 

\paragraph{Bounded-depth tree learning.}
The alternative to mapping a fixed policy to a small tree is to search the space of small-tree policies. We review two approaches that both provide an anytime algorithm which converges to the value-optimal tree within the space of trees with depth at most $d$.
The  \omdt approach~\citep{vos2023optimal} uses constraint solving via mixed-integer linear programs (MILPs) to do so: The MILP encodes both the structural constraints on the policy (being $d$-implementable) and the constraint that the policy achieves the optimal value (using the standard LP formulation for maximal discounted rewards).
The \dtpaynt approach~\citep{andriushchenko2025small} searches this space by iteratively constructing candidate policies in this space and mapping these policies into the smallest possible tree.
The candidate policies are constructed by value iteration on variants of the original MDP, while the policy mapping is done with constraint solving. On MDPs with more than 3000 decisions (i.e. states-action pairs), this iterative approach is significantly faster and produces better trade-offs between the DT size and quality~\cite{andriushchenko2025small}. 
Even though this approach performs well for the bounded-depth tree learning problem, \dtpaynt (and indeed \omdt) cannot cope with settings where a DT of bigger depth is needed to represent a good policy. While we use \dtpaynt, we embed it into a loop to mitigate this disadvantage.




\section{Variable Neighborhood Search}
This section outlines our approach, discusses the main ingredients, and then combines these ingredients into the algorithm called \integration.

Our approach is inspired by the principles of \emph{Variable Neighborhood Search (VNS)}~\citep{DBLP:journals/cor/MladenovicH97}. In the classical VNS, a candidate solution is associated with a sequence of different neighborhoods; these neighborhoods are gradually explored until the first improving solution is obtained.
In order to avoid local optima, an additional \emph{perturbation step} is performed.
VNS has proven to be useful in solving complex, combinatorial optimization problems such as large MILPs or continuous non-linear programs~\cite{Hansen2019}. In our case, we are also solving a multi-layered optimization problem as on one hand we are looking for a near-optimal policy which can be represented by a small DT and on the other, we are trying to represent this policy as optimally as possible.

Abstractly, \integration proceeds as follows.
In the initialization phase, we solve an MDP, obtaining an (arbitrary) optimal policy \schedopt, and use \dtcontrol to learn a tree $\tree_0$ that induces its partial restriction $\parres[\schedopt]$.
Afterwards, \integration iteratively improves the tree.
Every iteration of \integration (illustrated in Fig.~\ref{fig:one-iteration}) consists of a \emph{subtree neighborhood exploration (SNE)} and \emph{tree perturbation}.
During SNE, we explore different \emph{tree neighborhoods} of $\tree_i$ in search of a smaller admissible tree;
we define a tree neighborhood of $\tree_i$ as a set of trees differing from $\tree_i$ in one of its subtrees, and we use bounded-depth tree learning to search for a subtree having fewer nodes.
If no such tree can be found, we terminate the search; otherwise, let $\tree_i'$  denote the new tree.
In the tree perturbation step, we compute various transformations of $\tree_i'$ and proceed to the next iteration with the smallest one.
We detail SNE, the neighborhoods it considers, and perturbations below.


\begin{figure}[t]
    \centering
    \includegraphics[width=0.95\linewidth]{figures/one-iteration.pdf}
    \caption{Scheme of one iteration of our 
    approach. An iteration starts with tree 
    $\tree_i$ which corresponds to 
    $\pi_{\tree_i}$. \dtpaynt performs 
    subtree neighborhood exploration and 
    finds $\tree_{i}'$. This tree is then 
    translated to its corresponding policy 
    $\pi_{\tree'_i}$. Two tree perturbations 
    are performed from this policy: 
    i)~running \dtcontrol to obtain 
    alternative DT representation 
    $\tree'_{i,\mathsf{alt}}$ ii) fixing of 
    the policy using model checking in \storm 
    to obtain $\pi^{*}_{\tree'_{i}}$ and 
    running \dtcontrol on this new policy to 
    obtain $\tree'_{i,\mathsf{fix}}$. At the 
    end of the iteration, we have 3 candidate 
    trees available: $\tree'_{i}, 
    \tree'_{i,\mathsf{alt}}, 
    \tree'_{i,\mathsf{fix}}$.} 
    \label{fig:one-iteration}
\end{figure}

\subsection{Subtree Neighborhood Exploration}
Subtree neighborhood exploration (SNE) explores modifications of the tree $\tree$ where one of its subtrees is replaced with a smaller one.
In particular, assume an inner node $n \in N$ of~$\tree$.
$\tree|_n$ denotes a subtree of $\tree$ with root $n$ having sets $N|_n$ and $L|_n$ of inner and leaf nodes, respectively.
The $(n,d)$-neighborhood of $\tree$ consists of all trees that coincide with $\tree$ in every node except have a subtree of depth $d$ emplaced at an inner node $n$; the $n$-neighborhood of $\tree$ is the smallest neighborhood that includes all $(n,d)$-neighborhoods, $d < \mathsf{Depth}(\tree|_n)$.

To efficiently explore the $n$-neighborhood of $\tree$, we use off-the-shelf bounded-depth tree learning implemented in \dtpaynt.
For this purpose, we construct a sub-MDP $M|_n^\tree$ where we fix all actions in states defined outside $\tree|_n$ according to $\sched[\tree]$. Formally, we define $M|_n^\tree = \left(S,\sinit,\Act|_n^\tree,\mpm|_n^\tree,\mrm,\df\right)$ where $\Act|_n^\tree(s) = \Act(s)$ if $s \in \sts{\tree}{n}$, and $\Act|_n^\tree(s) = \{\sched[\tree](s)\}$ otherwise.
\dtpaynt enables efficient learning of a depth-bounded tree for this sub-MDP where the states having only one available action are ignored.
If \dtpaynt learns an admissible tree, the former replaces the latter in $\tree$ as a new subtree at node $n$; let $\tree'$ denote the new tree.
If \dtpaynt fails to obtain a smaller tree, we continue the search in $n'$-neighborhood where $n'$ is another inner node (details are to be introduced in Sec.~\ref{sec:integration}).


\subsection{Tree Perturbation}

While SNE can succeed in modifying and reducing the subtrees of $\tree$, it cannot, in principle, modify "higher" nodes of $\tree$ that lie closer to its root.
As a result, even consequent applications of SNE have a slim chance of obtaining a tree $\tree'$ that drastically differs from $\tree$.
Tree perturbation serves to avoid such local optima.
The aim here is to systematically perturb $\tree'$ in search of other (admissible) trees that differ in these higher nodes.
We consider two possible perturbations of $\tree'$ via \emph{tree reconfiguration} or \emph{policy repair}.
Both variants are illustrated in Fig.~\ref{fig:one-iteration} and are described below.

\paragraph{Tree reconfiguration.}
Let $\sched[\tree']$ be the policy induced by $\tree'$.
This policy can be translated back to a decision tree $\tree'_{\mathsf{alt}}$ using any data-driven tree learning approach, such as \dtcontrol. We note that there is no particular reason why the resulting $\tree'_{\mathsf{alt}}$ should coincide with the earlier $\tree'$.

\paragraph{Policy repair.}
Note that SNE may achieve a smaller tree $\tree'$ at the price of the reduced value (while still guaranteeing that $\tree'$ is admissible). We can \emph{repair} $\sched[\tree']$ to compensate for the new choices defined by the new subtree.
For this purpose, we construct a sub-MDP $M_{\mathsf{fix}}$ where we specify all actions in states defined inside $\tree'|_n$ according to $\sched[\tree']$. Formally, we define $M_{\mathsf{fix}} = \left(S,\sinit,\Act_{\mathsf{fix}},\mpm_{\mathsf{fix}},\mrm,\df\right)$ where $\Act_{\mathsf{fix}}(s) = \{\sched[\tree'](s)\}$ if $s \in \sts{\tree'}{n}$, and $\Act_{\mathsf{fix}}(s) = \Act(s)$ otherwise.
We then use any MDP solver to obtain the maximizing policy $\sched[\tree']^*$ for $M_{\mathsf{fix}}$; note that $\sched[\tree']^*$ coincides with $\sched[\tree']$ for every state $s \in \sts{\tree}{n}$. Finally, we use \dtcontrol to translate this policy into $\tree'_\mathsf{fix}$, another perturbation of $\tree'$.


\subsection{$\integration$: Our Local Search Approach}
\label{sec:integration}

\integration is our implementation of the variable neighborhood search with subtree neighborhood exploration and tree perturbations outlined above. We now clarify how we select the hyperparameters for SNE and how we pick the perturbed tree for the next iteration.

\paragraph{Depth of subtrees.} Before we start with an SNE step, we must determine on which neighborhood to run an SNE step. First, we only consider subtrees with a limited fixed depth $d$. Based on our testing, we set $d=7$ as it produced the most consistent results. This allows us to focus on smaller neighborhoods where finding a good tree is easier, and it also means that all subtrees are disjunct, meaning we are not doing optimizations on similar state spaces multiple times.
Moreover, in our benchmarks, the depth 7 subtrees typically cover a significant part of the DTs produced by \dtcontrol: their depth is between 6 and 19 and the larger trees are typically not balanced.
Additionally, the bounded depth approaches~\citep{vos2023optimal,andriushchenko2025small} work best for smaller depths, especially given a stricter timeout we discuss later.

\paragraph{Order of the neighborhoods.} We order these subtrees (or rather, the root nodes for these subtrees) into a queue to first aim to improve the most promising subtrees. Our early experiments have shown that the order of the subtrees is not largely important, however, it can sometimes lead to smaller DTs more quickly.
Intuitively, subtrees that require a large number of nodes to represent the policy for a small number of corresponding states are good candidates for splitting. For this, we use the following values: i) the value given by $\frac{\sts{\tree}{n}}{|N_{n}|}$ where $N_{n}$ are the descendant nodes from $n$ in $\mathcal{T}$ ii) the number of predicates in the node $n$ whose impurity~\citep{ashok2021dtcontrol} value is close to the impurity value of the chosen predicate in $n$.
The predicates with low impurity values represent a better chance of obtaining a small subtree when they are chosen. \dtcontrol chooses one predicate with the lowest impurity score, however, we try to leverage the fact that if there are many such predicates in a given node, there is a higher chance that \dtcontrol did not choose correctly. The impurity scores are a byproduct of \dtcontrol and are, therefore, computationally cheap proxies.

\paragraph{Timeout for SNE.} While we could run every SNE step exhaustively, we instead exploit the anytime nature of the SNE solvers and use a timeout of 60 seconds. This allows us to consider more neighborhoods which our early experiments showed to be more crucial to the performance of the algorithm compared to trying to optimize the individual subtrees too much. We call an SNE step successful if we find a smaller subtree.

\paragraph{Closing the loop.} Once a successful SNE step occurs and a tree $\mathcal{T}^{'}_{i}$ is obtained, the tree perturbation is performed. The smallest tree from the set $\{\tree'_{i}, \tree'_{i,\mathsf{alt}}, \tree'_{i,\mathsf{fix}}\}$ is greedily picked as the new optimum for the next iteration. We experimented with trying to pick based on more factors, such as the current value of the tree or the depth, but it seemed inconsequential. If $\mathcal{T}_{i+1} = \mathcal{T}^{'}_{i}$ i.e. the smallest tree was obtained by SNE step, we add to the current subtree queue the newly-emerged subtrees of depth $d$ in $\mathcal{T}^{'}_{i}$. If one of the other trees in TP was used we fully recompute the subtree queue. The next iteration begins with $\mathcal{T}_{i+1}$ and the new subtree queue with another SNE step. 




\section{Experimental Evaluation}
We now investigate the performance of our approach. \integration is implemented in Python, on top of \dtpaynt~\citep{andriushchenko2025small}, \dtcontrol~\citep{ashok2021dtcontrol}, and \storm~\citep{STORM}, see also Sec.~\ref{sec:sota}. The implementation and all benchmarks are publicly available\footnote{\url{https://doi.org/10.5281/zenodo.15642002}}.
Our evaluation aims to answer the following four questions:
\begin{enumerate}[topsep=2pt,leftmargin=2em]
    \item[Q1:] \emph{Does \integration scale to larger MDPs than \dtpaynt, the state-of-the-art approach for learning small DTs?}
    \item[Q2:] \emph{Does \integration learn smaller trees than \dtcontrol on large MDPs?} 

  \item[Q3:]  \emph{Does \integration outperform standard DT pruning?}
    
    \item[Q4:]  \emph{What is the impact of tree perturbations steps?}
\end{enumerate}

\paragraph{Experimental setting.} All experiments were run on a machine with an AMD EPYC 9124 16-core CPU and 380GB RAM. Each experiments runs on a single core using a 64GB memory limit. The timeout for all experiments was 1 hour. All algorithms/tools considered in the experimental evaluation are deterministic and the timing variation is negligible.



\paragraph{Benchmarks.}
We consider two types of benchmarks: \textbf{(1)}~4~models from~\citep{andriushchenko2025small} including 2 models from~\citep{vos2023optimal}. On these models, both \dtpaynt~\citep{andriushchenko2025small} and \omdt~\citep{vos2023optimal} were not able to find a DT that is smaller than the DT found by \dtcontrol and has a normalized value better than 0.75.
\textbf{(2)}~9 models from the standard MDP benchmarks from the QComp evaluation~\citep{budde2020correctness}. These models are parametrized, which allowed us to scale the size of the models. The left part of Tab.~\ref{tab:main-table} shows the size of the underlying MDPs and the number of state variables (more detailed information about the models is reported in Appendix~\ref{app:sec:models}).
For fairness, we equip all models with the action $\actrandom$ in all states, since \dtpaynt adds this action implicitly.

\begin{table*}[t]
\setlength{\tabcolsep}{2pt}
\centering
\input{tables/main-table-new}
\caption{Comparison between \dtpaynt, \dtcontrol and \integration. \dtcontrol
maps the optimal policy to a DT, while 
\dtpaynt and \integration search for DTs with the normalised relative error at most 0.05.
``-'' indicates that \dtpaynt reached the 1-hour timeout or the 64GB memory limit.
For \integration, we also report the number of iterations (the number of sub-trees analyzed using \dtpaynt) and the relative size (i.e. number of nodes) compared to the smallest tree produced by the other three methods (smaller values are better for \integration). \integration can always finish the last iteration, which explains a >1h run time.
}
\label{tab:main-table}
\end{table*}

\paragraph{Baselines.}
\dtpaynt searches for a DT with the smallest depth having the desired value. We set the maximum depth to 10, since \dtpaynt is not able to effectively explore deeper DTs~\citep{andriushchenko2025small}\footnote{For the \emph{cons-6-2} and \emph{ij-14-s} models requiring larger DTs, we tried to increase the maximal depth, but it did not help.}. \dtpaynt terminates as soon as the first DT is found; we report the time it takes or the timeout.  We run \dtcontrol with the same preprocessing that is part of \integration, i.e., we remove trivial choices and unreachable states. Note that this preprocessing is essential for the performance of \dtcontrol.





\paragraph{Results.}
For every benchmark, we run the different tools to obtain a resulting DT. We report the depth, the number of inner nodes, and the normalized relative error from the optimal value which captures the quality of the DT: 0 corresponds to no error (i.e. the DT represents an optimal policy), and 1 means the DT represents the uniform random policy, which can be represented by a 0-DT that chooses action $\actrandom$ in its only leaf node.
Tab.~\ref{tab:main-table} summarizes the experimental results for Q1 and Q2. The last column shows the relative size of the DTs produced by \integration compared to the smallest DT produced by \dtpaynt or \dtcontrol (smaller values are better for \integration, and values above 100\% indicates an increase of the size). 





\subsection*{Q1: $\integration$ vs. $\dtpaynt$}

In Tab.~\ref{tab:main-table}, we consider a bound 0.05 on the normalized relative error (the part of the input) and report the smallest DTs found by the tools.
We observe that in 8 out of 13 models \dtpaynt is unable to find DT with relative error less than 0.05; it reaches either the 1-hour timeout or the memory limit. In contrast, \integration is able to find DTs with the required value for all the models. For 3 out of 5 models where \dtpaynt was able to find the desired tree, \integration finds a significantly smaller alternative. In the remaining 2 cases, two variants of $\emph{firew}$ model, DTs of the depth 2 are sufficient to achieve the desired value. The incremental search strategy implemented in \dtpaynt was able to find these DTs after a few minutes, while \integration does not allow \dtpaynt to search that long on a single subtree. However,
for any benchmark that does not allow for a policy with a tiny tree, \integration scales significantly further than \dtpaynt (or \omdt).

 








\subsection*{Q2: $\integration$ vs. $\dtcontrol$}


Recall that \dtcontrol favors scalability over minimality of the resulting DTs, for the MDPs in this benchmark, it is therefore very fast. Note that \dtcontrol always maps the optimal policy i.e. it has error 0.
The results in Tab.~\ref{tab:main-table} show that \integration produces significantly smaller DTs than \dtcontrol. In many cases, the reduction is substantial both in terms of the size and the depth: from hardly explainable DTs having over 60 or even  200 nodes, we get DTs with around 10 to 20 nodes representing an explainable policy achieving almost optimal values.  





To showcase the flexibility of our approach, we increased the acceptable error to 10\% to obtain even more explainable policies. In 10 out of 13 models, \integration finds smaller DTs compared to what \integration found with 5\% error. These DTs are on average 35\% smaller, and the greatest decrease happened on \emph{pacman-30}, from 23 to 10 nodes, and on \emph{wlan-4}, from 25 to 12 nodes. Complete results for error threshold set to 10\% can be found in Appendix~\ref{app:sec:error} in Tab.~\ref{tab:app-different-threshold}.

\subsection*{Q3: $\integration$ vs. tree pruning}

We also compare \integration to the standard DT pruning~\citep{mitchell97}, which would be the straightforward scalable solution assembled from the previously available components. 
Starting from an exact DT representation of an optimal policy (delivered by \dtcontrol), the idea is to iteratively merge leaf nodes and run a model-checker to ensure that the value of the candidate DTs is still above the given threshold (in this case, given as 5\% normalized relative error). 
This idea was implemented in \cite{CAV15} (together with learning the full tree using a heuristic weighing the decisions by their ``importance'').
However, due to the unavailability of that code, we use an alternative implementation of pruning available in \dtcontrol.
Tab.~\ref{tab:pruning-table} summarizes the results~\footnote{The implementation does not support discounted rewards and thus the table includes only a subset of the models}; it reports the relative size reduction (in terms of the number of nodes) compared to the DT (provided by \dtcontrol) that represents the optimal policy. Note that \integration is able to perform a preprocessing step on the optimal policy which removes unreachable states. This preprocessing was not possible with the available implementation of pruning and, therefore, to compare only the effect of these techniques and not the effect of preprocessing, the column size\% reports the reduction from the preprocessed policy. We see that except for the \emph{firew-false} model,
\integration achieves a significantly better reduction than the pruning technique. These results clearly show that for more complicated policies, a simple pruning technique is not sufficient. Additionally, we include the column size$\star$\% which also includes the effect of preprocessing on the reduction for \integration.


\begin{table}[t]
\setlength{\tabcolsep}{2pt}
\centering
\input{tables/pruning-table-new}
\caption{Comparison between \integration and a tree pruning technique implemented in a new version of \dtcontrol. It reports the normalized relative error and the relative size of the resulting DTs compared to the DT (computed by \dtcontrol) that represents the optimal policy. size$\star$\% reports size reduction including a pre-processing step on the initial optimal policy which we were unable to perform for the Pruning approach.}
\label{tab:pruning-table}
\end{table}


\subsection*{Q4: Ablation Study}
This section investigates the impact of the tree perturbations in \integration. We have implemented \integrationnaive, a simpler variant of \integration where \dtcontrol is called only once, to obtain the initial DT \tree for the optimal policy \schedopt, and only SNEs are performed using \dtpaynt with no tree perturbations.
To study the impact of perturbations and the ability to escape local optima, we ran \dtcontrol with different parameters, which gave us, for each model, four different initial DTs with different topologies, sizes and impurity values.
Tab.~\ref{tab:robustness-table} summarizes the results: it reports the average relative size (over the 4 runs) of the resulting DTs produced by \integration and \integrationnaive, compared to the size of the initial DT. The complete results can be found in Appendix~\ref{app:sec:ablation}.


\begin{table}[t]
\setlength{\tabcolsep}{1pt}
\centering
\input{tables/robustness-table-2}
\caption{Comparison between \integration and \integrationnaive. It reports 
the average relative size of the resulting DTs (achieved over 4 different initial DTs) compared to the size of the initial DT (again, the lower the numbers are the better).}
\label{tab:robustness-table}
\end{table}

On most of the models, \integration provides slightly better results showing that while \integrationnaive performs very well, the perturbations play a part in the effectiveness of the approach. 
Most extreme examples are models \emph{cons-6-2} where in one case \integration finds a 10 times smaller tree compared to $\integration\dagger$ and \emph{csma-3-2} where on the biggest initial tree of size 1856 \integration gets a tree with just 8 nodes compared to 603 nodes produced by $\integration\dagger$. 
However, overall these results show that the more important part of \integration is the SNE.


\section{Related work}




Decision trees have been used for a \emph{post-hoc} representation of policies with guarantees on $\varepsilon$-optimality since \citet{CAV15}.
This seminal paper (i) learns a DT using a heuristic assigning ``importance'' to each decision and then (ii) applies standard DT pruning (of the C4.5 algorithm \citep{mitchell97}) step by step until the given imprecision $\varepsilon$ is incurred.
In contrast, most of the subsequent work has focused on representing the policies \emph{exactly}, e.g., \citep{DBLP:conf/tacas/BrazdilCKT18,DBLP:conf/qest/AshokBCKLT19,DBLP:conf/qest/AshokKLCTW19,DBLP:journals/sttt/JungermannKW23,DBLP:conf/vecos/BuddeDH24}, notably the tool dtControl \citep{DBLP:conf/hybrid/AshokJJKWZ20,ashok2021dtcontrol}.
The precise representation is particularly useful for detecting bugs~\citep{CAV15,ashok2021dtcontrol} and validation of modeling~\citep{KushJonis}.
Binary decision diagrams (BDDs) can also be used for this purpose with the caveats described in~\citep{JanHSCC25} and their generalizations even for computing optimal policies \cite{DBLP:conf/uai/HoeySHB99,DBLP:conf/tacas/AlfaroKNPS00}.

DTs have also been used \emph{during} the search for optimal policies, both in classical dynamic programming \citep{DBLP:conf/ijcai/BoutilierDG95,DBLP:conf/icml/BoutilierD96} and in reinforcement learning \citep{pyeatt2001decision,DBLP:conf/aaai/GuptaTB15,DBLP:conf/aaai/TopinMFV21}, recently also to generalize policies on small instances of parametrized models to larger ones~\citep{DTstrat}.

Methods for post-hoc representation without guarantees on the value typically process a more complex teacher policy into a DT via imitation learning \cite{bastani2018verifiable} or distillation \cite{kohler2024interpretable}. Alternatively, the underlying value functions can be approximated using linear models ~\citep{du2019good} or shallow neural networks \cite{rusu2015policy,julian2016policy}.
However, these approaches have been applied primarily in the context of reinforcement learning and typically do not provide any optimality guarantees.




\section{Conclusion}
We have provided a new method to represent MDP policies approximately (with a given bound on the suboptimality) by much smaller trees than the state-of-the-art approaches.
In contrast to the tools providing exact representation, our smaller trees provide a more realistic tackle at explainability.
In contrast to the methods providing the smallest possible trees, our method easily scales to million-states MDPs.
Finally, in contrast to scalable imprecise methods such as pruning, our method analyzes the trees with more care, achieving both smaller trees and better precision.
Altogether, the combination of a DT-learning tool, a model checker and an inductive SMT learner results in a balanced synergy. 
In future work, we will aim at even closer intertwining of the tools, so that the DT-learner provides even a richer set of guidance to the inductive learner, assisted by more frequent but also more approximate feedback from the model checker.
Future extensions also include working with richer domain-specific predicates to improve tree compactness and explainability, 
and designing better perturbation approaches to escape local optima more efficiently.

\subsubsection{Acknowledgments.}
This work has been supported by the Czech Science Foundation grant \mbox{GA23-06963S} (VESCAA), the IGA VUT project FIT-S-23-8151, the NWO VENI Grant ProMiSe (222.147), MUNI Award in Science and Humanities grant (MUNI/I/1757/2021) and
the ERC project \mbox{InOVationCS} (101171844).


% \begin{contributions} % will be removed in pdf for initial submission 
% 					  % (without ‘accepted’ option in \documentclass)
%                       % so you can already fill it to test with the
%                       % ‘accepted’ class option
%     Briefly list author contributions. 
%     This is a nice way of making clear who did what and to give proper credit.
%     This section is optional.

%     H.~Q.~Bovik conceived the idea and wrote the paper.
%     Coauthor One created the code.
%     Coauthor Two created the figures.
% \end{contributions}

% \begin{acknowledgements} % will be removed in pdf for initial submission,
						 % (without ‘accepted’ option in \documentclass)
                         % so you can already fill it to test with the
                         % ‘accepted’ class option
%     This work has been supported by the Czech Science Foundation grant \mbox{GA23-06963S} (VESCAA), the IGA VUT project FIT-S-23-8151, the NWO VENI Grant ProMiSe (222.147), MUNI Award in Science and Humanities grant (MUNI/I/1757/2021) and
% the ERC project \mbox{InOVationCS} (101171844).
% \end{acknowledgements}

% References
 % \clearpage

\bibliography{bibliography,references}

\newpage
\onecolumn

% \title{Title in Title Case\\(Supplementary Material)}
% \maketitle

\appendix
\include{appendix}



\end{document}
