\section{Introduction}

\paragraph{Deterministic Markov Decision Processes.}
Deterministic Markov Decision Processes (DMDPs) \citep{Puterman94} are a mathematical framework for sequential decision-making where an agent interacts with a fully deterministic environment.  They are modeled as a graph, where in every step the controller chooses a successor vertex from the neighbors of the current vertex. This repeated process generates an infinite sequence of vertices (called a run). Policies for the controller provide the successor vertex choice at every vertex. A payoff function assigns a real value to every run. We consider a classical and well-studied function: the mean-payoff (or limit-average) payoff function~\citep{Puterman94,Filar12}. Every edge of the graph is assigned an integer weight, and the
payoff of a run is the long-run average of the weights of the run.

\paragraph{Applications.}
This formalism is particularly relevant in settings where system behaviors are fully known, such as controlled robotic environments or algorithmic planning tasks~\citep{blondel2000survey}. DMDPs also appear in formal verification and synthesis, where deterministic transitions allow for tractable analysis of safety, liveness, and temporal logic specifications~\citep{Baier08}. For example, in autonomous systems, DMDPs are a basic model for synthesizing controllers that guarantee correct behaviors~\citep{Alur15}.
Besides the practical applications, DMDPs can demonstrate fundamental computational limits, e.g., NP-hardness of optimal planning \citep{Littman97}. Furthermore, this model corresponds to classical directed weighted graphs, which have many applications such as network routing, travel planning, etc~\citep{Cormen22}.

\paragraph{Motivation.} One of the main algorithm in the area of planning and sequential decision-making is Howard's policy iteration~\citep{Howard60}. \citet{Dasdan04} compared various algorithms and showed that Howard's algorithm works well in practice as compared to other algorithms. Furthermore, lower and upper bounds for policy iteration algorithms have deep theoretical impact, e.g., lower bounds for policy iteration lead to lower bounds for pivoting approaches in linear programming~\citep{Friedmann11}. Hence better theoretical understanding of Howard's algorithm is an interesting problem. Our work aims at the theoretical understanding of this fundamental algorithm for DMDPs with mean-payoff objectives. We first recall the previous results from the literature.

% \jakob{As additional motivation, we could add that there are still open questions regarding strategy iteration for which a better understanding of Howard's could be valuable -- and potentially circle back to that in the section/remark where we say that our example also works for discounted objective with $\lambda$ very close to 1.}\ali{I added more motivations.}

% \paragraph{Previous Work on Howard's Policy Iteration.} Howard's policy iteration has been extensively studied for general MDPs (not necessarily DMDPS) for discounted-sum and mean-payoff objectives.
% For MDPs with mean-payoff objectives the best known upper bound is exponential~\citep{Puterman94}. \citet{Fearnley10}, inspired by the work of \citet{Friedmann09}, showed that Howard’s algorithm requires exponential time for a specific family of MDPs. 
% For MDPs with discounted-sum objective, better upper bounds are known for special cases of the discount factor. \citet{Post15} established a strongly polynomial running time for MDPs with discounted-sum objectives when the discount factor is constant or represented in unary; \citet{hansen2013strategy} improved Post’s bounds and extended the results to 2‐player settings. However, for arbitrary discount factors, the lower bound of \citet{Fearnley10} is applicable. 


\paragraph{Previous Work on Howard's Policy Iteration.}
Howard's policy iteration has been extensively studied for general MDPs (not necessarily DMDPs) with discounted-sum and mean-payoff objectives. For MDPs with mean-payoff objectives, the best-known upper bound is exponential~\citep{Puterman94}. \citet{Fearnley10}, inspired by the work of \citet{Friedmann09}, showed that Howard’s algorithm requires exponential time for a specific family of MDPs. For discounted-sum objectives, better upper bounds are known in special cases of the discount factor: \citet{Post15} established a strongly polynomial running time when the discount factor is constant or represented in unary, and \citet{hansen2013strategy} improved Post’s bounds and extended these results to 2-player settings. However, the exponential lower bound from \citet{Fearnley10} still holds for arbitrary discount factors. Related problems, such as MDPs with total-reward objectives (analogous to the stochastic shortest path), also exhibit exponential lower bounds for strategy iteration~\citep{Fearnley10}. While exponential lower bounds are established for stochastic models~\citep{Fearnley10}, game-theoretic models~\citep{Friedmann09}, and linear programming pivoting rules~\citep{Friedmann11}, identifying lower bounds specifically for Howard's policy iteration in simpler deterministic graph models has remained a compelling open question. Our contribution addresses this by presenting improved lower bounds for this simplest model, complementing the more general cases already explored in the literature.

\paragraph{Recent Results on Howard’s Policy Iteration.}
There are several recent advancements on Howard’s Policy Iteration in the literature, which highlight ongoing research in this direction. \citet{LoffSkomra24} demonstrated polynomial-time smoothed complexity for deterministic models (DMDPs and turn-based games with mean-payoff and discounted-sum objectives). However, \citet{ChristYannakakis23} presented a sub-exponential lower bound for the smoothed complexity of Howard’s algorithm for stochastic MDPs with mean-payoff objectives. Moreover, \citet{AsadiLICS24} recently improved complexity results for two-player turn-based discounted games with unary weights through a new analysis for Howard’s policy iteration.


\paragraph{Previous Work on DMDPs.}
DMDPs have been studied extensively in the literature~\citep{Arora12,Boone23,Castro20,Madani02,MadaniTZ09}. \citet{Karp78} presented an algorithm for solving DMDPs with mean-payoff objectives in $\calO(mn)$ time (where $m$ is the number of edges and $n$ is the number of vertices), while \citet{Young91} proposed an $\calO(mn + n^2 \log n)$‐time algorithm that often performs better in practice despite its slightly worse time complexity. Although Howard's policy iteration works well in practice~\citep{Dasdan04}, the known theoretical upper and lower bounds for DMDPs with mean-payoff objectives are as follows: The best-known upper bound is exponential~\citep{Puterman94}.
% $\leftarrow$\jakob{Since the number of positional strategies is at most $n^n$ this result might just be a corollary of the correctness of Howard's}. 
Moreover, a better parametric bound of $\calO(n^3 W)$ on the number of iterations can be obtained, since by \cite{Howard60}, the number of policy iteration steps is at most the number of value iteration steps, which by ~\citet{zwick1996complexity} is bounded by $\calO(n^3 W)$, where $W$ is the maximum absolute weight. 
\citet{hansen2010lower} presented a lower bound, giving DMDPs with $2n$ vertices, $m$ edges, and edge weights of $\calO(n^{{n^2}})$, on which the algorithm requires $m-n+1$ iterations to find an optimal policy.
For example, (a)~with $m=O(n)$, this result shows that on input size of $\widetilde{O}(n^3)$, the algorithm requires $\Omega(n)$ iterations, or (b)~with $m=O(n^2)$, this shows that on input size of $\widetilde{O}(n^4)$, the algorithm requires $\Omega(n^2)$ iterations. In particular, given the input description of an DMDP with $I$ bits, the results shows that the lower bound on iterations is $\widetilde{\Omega}(\sqrt{I})$. In computer science establishing and improving lower bounds are challenging, and whether this lower bound can be improved is a fundamental problem which we address in this work.


% Dasdan~\cite{Dasdan04} compared various algorithms including Howard’s policy iteration~\cite{Howard60} and demonstrated that Howard’s algorithm often outperforms Karp’s and is comparable with Young et al.’s in speed. However, the worst‐case complexity of Howard’s algorithm has been a mystery: Fearnley~\cite{Fearnley10}, inspired by the work of Friedmann~\cite{Friedmann09}, showed that Howard’s algorithm takes exponential time for a family of hand-crafted MDPs. Moreover, Hansen et al.~\cite{hansen2010lower} constructed a family of DMDPs, on which Howard's algorithm performs quadratic iterations. Upper bounds for Howard's algorithm have also been established. Post et al.~\cite{Post15} established a strongly polynomial running time for MDPs with discounted-sum objectives; Hansen et al.~\cite{hansen2013strategy} improved Post’s bounds and extended the results to 2‐player settings, but again this result relies heavily on discounting. Thus, none of these results can be extended to the DMDPs with mean-payoff objectives. 





\paragraph{Our Contributions.}
The above motivates the study of Howard's policy iteration for DMDPs with mean-payoff objective. In this work, we construct a family of DMDPs with $2n$ vertices, $\calO(n^2)$ edges, and edge weights of $\calO(n^2)$, on which Howard's algorithm requires $\Omega(n^2)$ iterations to find an optimal policy. Hence, the improved lower bound is as follows. Given the input description of an DMDP with $I$ bits, the required number of iterations is $\widetilde{\Omega}(I)$. Table~\ref{table:results-sum} summarizes the results.
	
	\begin{table*}[t]
		\centering
		\caption{%
			Comparison of lower bounds for Howard's policy iteration. $|\Vertices|$, $|\Edges|$, and $W$ correspond to the number of vertices, number of edges, and maximum absolute weight, respectively.
		}
		\label{table:results-sum}
		\begin{tabular}{|c|c|c|c|c|c|}
			\hline
			  & $|\Vertices|$ & $|\Edges|$ & $W$ & Size & \# Iterations\\ 
                \hline
                \hline
                \citet{hansen2010lower} & $2n$ & $m$ & $\calO(n^{n^2})$ & $\calO(mn^2\log n)$ & $m-n+1$\\
                \hline
                Ours & $2n$ & $\calO(n^2)$ & $\calO(n^2)$ & $\calO(n^2\log n)$ & $\Omega(n^2)$\\
                \hline
		\end{tabular}
	\end{table*}

\paragraph{Significance.} As compared to the work of~\citet{hansen2010lower}, the significance of our result is twofold. First, with respect to the input size of $I$, we improve the lower bound from $\widetilde{\Omega}(\sqrt{I})$ to $\widetilde{\Omega}(I)$. Second, there is an important implication with respect to the $\calO(n^3W)$ parametric bound by \citet{zwick1996complexity}. For the family of examples considered in~\citet{hansen2010lower}, the best known upper bound is exponential as the weights are exponential. Thus, the examples of~\citet{hansen2010lower} belong to a class where the upper bound is exponential  and a sub-linear lower bound is presented. In contrast, since the weights are polynomial for our class of examples, the upper bound on the number of iterations is polynomial (namely, quadratic with respect to the input size) and we present an almost-linear lower bound.


\paragraph{Technical Novelty.}
In the examples of \citet{hansen2010lower}, Howard's policy iteration goes through $\Theta(n^2)$ "good" cycles and due to the structure of the graph only ever finds a slightly better cycle. Each cycle weight is determined by a "key" edge with largest weight. To achieve the lower bound for policy iteration, the algorithm requires the next key edge weight to differ from the previous one by a multiplicative factor of $n$ (or the cycle length, which can be up to $n$), leading to exponential edge weights. In contrast, our examples avoid this, inspired by the results of~\citet{Friedmann09,Fearnley10}, using a similar concept to their "deceleration lane". Our DMDPs only have $n$ "good" cycles that do not overlap edgewise. Thus, each cycle only needs to have a weight of $1$ more than the previous one. The "deceleration lane" technique forces the algorithm to perform $\Omega(i)$ iterations to find the $i$th cycle after having found the $i-1$th cycle. This structure ensures that (a)~Howard's algorithm finds the cycles in the right order and (b) performs $\Omega(n^2)$ iterations to find an optimal policy.



% have only $\Theta(n)$ cycles. Thus, the cycles do not require to overlap edgewise, and each cycle only needs to have a weight of $1$ more than the previous one. In addition to that, instead of $1$, it takes PI roughly $i$ iterations to find the $i$th cycle in the sequence after having found the $i-1$th cycle. This is because of the bottom row of vertices (comparable to the 'deceleration lane' from Fearnley and Friedmann) and ensures (a) that PI 'finds' the cycles in the right order and (b) roughly needs $1+...+n\approx n^2/2$ iterations to do so. Like in Fearnley/Friedmans deceleration lane, the algorithm does many incremental increases in potential before finding an increase in value. For that, the edge weight of the deceleration edges needs to be only 1 more than the average weight of the 'best' cycle. Finally, we need to ensure that with those deceleration lane edges, no cycle better than the 'best' cycle can be formed. Since such a cycle would have at most $\Theta(n)$ edges and at least one edge of small (say, zero) weight, it suffices to have the weight of our best cycle be $\omega(n)$, e.g., $\Theta(n^2)$, leading to $\Theta(n^2)$ edge weights. [I'll think about that last sentence more tomorrow to make sure it's true but it appears to me as if actually also $\Theta(n \log(n))$ weights might suffice for our example.]  }

% \ali{We can write an intuition paragraph in section 4 to show why our example takes $\Omega(n^2)$ iterations.} 

% \jakob{Are we adding a brief paper outline here?}\ali{Since the paper is not very long. I don't think it's necessary to write the outline.}

% It is noteworthy that the absolute weights are bounded by $\calO(n^2)$, which result in a DMDP size of $\calO(n^2 \log n)$. In contrast, the construction of \cite{hansen2010lower} with the same number of vertices and edges requires edge weights of $\calO(n^{2n})$. This difference highlights that our approach achieves significantly smaller weight magnitudes and a DMDP size, thereby improving upon existing lower bounds in the literature.

