\section{Remarks on experiment involving algorithm from \cite{YabeHSIKFK18}}
\label{secappendix: yabe-et-al-issues}
We mention few issues faced while implementing \PI\ using the details from \cite{YabeHSIKFK18} and how we resolved them: $(a)$ In Step $(3)$ of Algorithm $1$ in \cite{YabeHSIKFK18} (which is a subroutine for \PI), they iterate over all possible assignments to the parents of each node. Specifically, the algorithm would be exponential time in the in-degree of the reward node $Y$ and therefore it runs efficiently only when $Y$ has a small number of parents. \SRM\ does not face this issue. To compare both algorithms we therefore created instances where in-degree of $Y$ was small. $(b)$ Another issue faced while implementing their algorithm is in an inequality condition specified in Equation $4$ of \cite{YabeHSIKFK18}. We observe that this inequality is trivially satisfied unless the time period becomes very large (of the order of $\geq 10^{10}$) even for their experiments given in Section $5$ of \cite{YabeHSIKFK18}. Since running the algorithms for such a long time period is not feasible, we run both algorithms till we see clear convergence of \SRM. $(c)$ A third problem we faced was in setting the time period range for our Experiments. They use $T\in \{C, 2C, \ldots,9C\}$, but in Step $3$ of Algorithm $1$ and Step $4$ of Algorithm $2$ in \cite{YabeHSIKFK18}, they estimate probabilities using $T/3C$ samples. This would leave them with at most $3$ samples for such an estimation which would give noisy and unreliable estimates. Instead of using this set of values for $T$, we use equally spaced points in a time range where we see clear convergence of \SRM\ $(d)$ Finally, it is not discussed how the optimization problem giving $\widehat{\eta}$ in Step $12$ of Algorithm $2$ of \cite{YabeHSIKFK18} is solved, and they use a fixed value for $\widehat{\eta}$ in experiments. Since there is no technique proposed to solve the optimization problem, we use the same fixed $\widehat{\eta}$ as them.