\section{Extensions}\label{Section:Extensions}
We discuss several extensions of \Cref{thm:main-result}. 

\paragraph{Policy Initialization.}
The number of iterations that Howard's Policy Iteration algorithm performs depends on the choice of the initial policy $\policy_0$. Two natural choices are for each vertex to use (a)~the edge to it's lowest-index neighbor or (b)~the highest-weight outgoing edge (breaking ties by vertex indices). Option (b) is the most common in literature and was used in~\citet{Howard60}. While our lower bound proof above uses option (a) to get the starting policy, it is noteworthy that our lower bound also holds for (b). In particular, we let \[
    \policy_{0}(\vertex) \defas \arg\max_{\otherver\in\Edges(\vertex)} \weight(\vertex, \otherver) =
    \begin{cases}
        t_1 & \vertex = b_1\\
        b_1 & \text{otherwise}
    \end{cases}\,.
\]
One iteration of Howard's algorithm on  $\policy_0$ leads to $\policy_{1,2}$, from which on the algorithm proceeds as described above, iterating over the policies in the sequence shown in \Cref{eq:howard-seq} starting at $\policy_{1,2}$. Compared to using $\initpol_{1}$ as an initial policy,  Howard's algorithm with initial policy $\policy_0$ only performs one iteration less. Thus, the number of iterations for this policy initialization is still $\Omega(n^2)$.

\paragraph{Discounted-sum Objectives.} In discounted-sum objectives, every edge is assigned an integer weight and the payoff is the discounted sum of these weights. Although \Cref{thm:main-result} is stated for mean-payoff objectives, it extends to discounted-sum objectives with a discount factor sufficiently close to 1 as a function of $n$, because as the discount factor approaches 1, discounting diminishes and the sum converges to the mean-payoff value. Furthermore, by Blackwell optimality~\citep{Blackwell62}, an optimal policy optimal for the mean-payoff objectives remains optimal for all discount factors sufficiently near 1.

It is an open question whether our techniques can be extended to obtain a similar lower bound for a discounted-sum objective with a constant discount factor. In this setting, a lower bound of $\Omega(n^2)$ on the number of iterations would be tight up to a factor of $\log n$ to the upper bound due to \cite{hansen2013strategy}, which applies not only to DMDPs, but also stochastic MDPs and in the 2-player setting. 

