\documentclass[11pt]{article}

% Core packages
\usepackage[margin=1in]{geometry}
\usepackage{microtype}
\usepackage{amsmath,amssymb,amsthm,mathtools}
\usepackage{bm}
\usepackage{dsfont}
\usepackage{enumitem}
\usepackage[colorlinks=true,linkcolor=blue,citecolor=blue,urlcolor=blue]{hyperref}
\usepackage[capitalize,noabbrev]{cleveref}

% Theorems
\theoremstyle{plain}
\newtheorem{theorem}{Theorem}[section]
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{corollary}[theorem]{Corollary}
\theoremstyle{definition}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{assumption}[theorem]{Assumption}
\theoremstyle{remark}
\newtheorem{remark}[theorem]{Remark}

% Notation
\DeclareMathOperator{\KL}{KL}
\newcommand{\E}{\mathbb{E}}
\newcommand{\R}{\mathbb{R}}
\newcommand{\Prb}{\mathbb{P}}
\newcommand{\Var}{\mathrm{Var}}
\newcommand{\Wtwo}{W_2}
\newcommand{\Wc}{W_c}
\newcommand{\nablax}{\nabla_{\mathbf{a}}}
\newcommand{\s}{\mathbf{s}}
\newcommand{\aVec}{\mathbf{a}}
\newcommand{\bVec}{\mathbf{b}}
\newcommand{\thv}{\bm{\theta}}
\newcommand{\pith}{\pi_{\thv}}

\title{WPO Theory Notes: Extensions and Improvements}
\author{Research Notes toward an Improved WPO Paper}
\date{\today}

\begin{document}
\maketitle

\section*{Expert Pass (Plain Text Notes)}
\begin{verbatim}
# Big picture (what you have & what to amplify)

* You’re doing “policy transport via Wasserstein flow + Fisher projection → natural step with a mixed derivative cross-term,” plus invariances, a clean Gaussian/SPD treatment, and a principled c-Wasserstein stabilizer.
* The most valuable angle is the projection-based framework: when it matches NPG (exponential families with affine-in-action stats), when it departs (mixtures, squashing), and why c-Wasserstein is a principled shrinkage of action-gradients.

# Top changes to make before you write anything else

1. Own the projection choice: say plainly you evolve with a W2 flow but project in the Fisher/KL inner product because it yields a covariant, estimable natural step on the parametric manifold. Briefly note what a W2-parametric projection would look like and when the two coincide (affine-in-action exponential families).
2. Handle the state weighting cleanly: carry the state weighting consistently through the PDE and energy identity; explicitly say you absorb it via per-state time reparametrization and you use the same weighting inside the projection metric, so the direction is unchanged.
3. Move the weak-form and boundary assumptions forward (where you first integrate by parts). That removes 90% of technical nitpicks.
4. Separate “equivalence” from “departure”: one short subsection where you (i) list conditions for equivalence to NPG and (ii) give one crisp non-Gaussian counterexample where directions genuinely differ (e.g., a simple two-component mixture).
5. Make c-Wasserstein concrete: name the primal penalty, state the monotonicity property you rely on, and spell the exact energy inequality you get in the frozen-critic regime.
6. Add a deterministic-limit note: as covariance → 0 for Gaussians, you recover deterministic policy gradients for the mean; covariance step vanishes.
7. Tighten the full-covariance block: explicitly state the symmetry you’re targeting (and the SPD-preserving line-search/backtracking rule for Cholesky).
8. Notation hygiene: pick one state-weight symbol; fix “Gâteaux”; keep boldface/roman consistent for vectors vs. scalars.

# Section-by-section guidance (what to flesh out)

Abstract/Intro

* Recast contributions as: (i) projection-based transport view producing a covariant natural step with a mixed derivative cross-term; (ii) equivalence to NPG under clear conditions and principled deviations otherwise; (iii) full-covariance SPD-preserving implementation; (iv) c-Wasserstein stability with a precise conjugate map; (v) fully spelled assumptions/weak-form.

Background

* Insert a compact, explicit “weak form & boundary” paragraph (continuity equation in weak form; no-flux/vanishing-tails), and define the L²(π) inner product for vector fields you use in energy identities.

Derivation (W2 → parametric update)

* Right after writing the flow, immediately state how state weighting affects the velocity and energy, and that you match the same weighting in the projection inner product.
* When you integrate by parts to get the cross-term, reference the weak-form assumptions inline (not only later).

Projection metric vs. W2

* Keep your Fisher projection as the primary path. Add two sentences on W2-parametric projection (via velocity potentials) and a one-liner on when Fisher and W2 projections coincide (affine-in-action exponential families). Note that elsewhere they differ by a positive scalar/operator and that Fisher is preferred for statistical reasons (covariance, simple estimators).

Energy dissipation

* State the identity with the correct state-weight scaling; then remark that after per-state time rescaling the expression reduces to the unweighted form people expect. Clarify: all of this is in the frozen-critic regime.

Gaussian updates

* Keep the mean/variance formulas as you have them. For full covariance:

  * Say exactly what shape each object has (mean vector, M matrix, symmetric target).
  * Add one sentence on complexity (O(d³)) and note that diagonal updates are O(d).

Cholesky (SPD preservation)

* Be explicit that the target increment is the symmetric part (and you’re realizing it exactly via the triangular Sylvester equation).
* Write one practical rule: use backtracking on the step size to keep the Cholesky diagonals positive; optionally mention a tiny floor on diagonals to avoid degeneracy.

c-Wasserstein

* Add the primal penalty name and the fact the conjugate gradient map is direction-preserving and monotone (hence the clean energy inequality).
* If you can spare two sentences, compare to gradient clipping: your map preserves variational structure and energy decay in the frozen-critic setting.

Main results

* Keep them, but add a small “scope” note: results hold with fixed critic/state weighting; when the critic and occupancy evolve, an extra residual term appears.

Theoretical discussion

* Explicitly list a non-Gaussian departure case (mixtures or tanh-squashed Gaussians). Say, in one sentence, that the mixed derivative term encodes policy-manifold curvature and can rotate the step relative to standard NPG.

Related work

* Add two quick pointers: (1) Stein variational policy gradients (score-based transport) and how your transport differs; (2) natural gradient on SPD manifolds (affine-invariant metric) and practical second-order approximations (e.g., K-FAC) to signal you know the literature landscape.

Limitations

* Make explicit: frozen-critic reliance for energy identities; lack of fully coupled guarantees; no analysis of estimator variance for the cross-term; W2 vs. Fisher projection gap beyond the exponential-family case.

# Minimal appendix blueprint (what “fully written” should include)

* Weak form & boundary conditions: precise statement and the L²(π) inner product definition for vector fields.
* Energy dissipation proofs: one for W2, one for c-Wasserstein, both in the frozen-critic regime, both referencing monotonicity (for c*).
* Fisher–Galerkin normal equations: derive orthogonality → normal equations cleanly; state the regularity you use to commute derivatives.
* Baseline invariance via constrained variation: short Lagrangian argument with per-state multiplier.
* Parameterization covariance: pullback Fisher and how the cross-term transforms; show the step transforms as a tensor.
* Gaussian blocks: explicit Fisher entries (1D and multivariate structure), cross-term components, and the induced steps.
* Cholesky SPD proof: linearization Σ = L Lᵀ, triangular Sylvester solve, and the simple condition that ensures positive diagonals after a step.
* Deterministic limit: continuity argument showing mean update → DPG direction, covariance update → 0 as Σ → 0.
* Fisher vs. W2 projections: short note on the W2-parametric projection via velocity potentials and a lemma for coincidence under affine-in-action stats; state the qualitative deviation otherwise.
* Estimator notes: how to compute the cross-term with autodiff (mixed partial/JVP), variance considerations, and control variates.
* Assumptions recap: smoothness, Lipschitz of ∇aQ, boundary behavior, semi-gradient scope.

# Practical implementation notes worth adding (one-liners)

* Cross-term estimation: use reparameterization and automatic differentiation to get ∇θ∇a log π, pair with a critic gradient; mention variance control (e.g., baseline-aligned control variates).
* Step-size: for Cholesky, backtrack until diagonals remain positive; for c-Wasserstein, you can tolerate larger steps due to the shrinkage map.
* Complexity: mean update O(d); full covariance O(d³); diagonal covariance O(d).

# If you add one small figure (optional but persuasive)

* Toy 2D ill-conditioned quadratic critic surface: compare one-step decrease vs. step size for PG, NPG, and your c-Wasserstein step. It visually motivates the stability claim.

---

If you implement the eight top changes and fill the appendix with the blueprint above, the paper will read as a confident, well-scoped expert treatment of “policy transport via projection,” rather than a tentative revisit.
\end{verbatim}

\section{Objectives and Checklist}
We delineate a mathematically rigorous plan to extend and clarify the theory in the WPO paper, keeping compute constraints out of the theoretical scope.
\begin{itemize}[leftmargin=*]
  \item Make assumptions explicit (regularity, boundary conditions) for all derivations.
  \item Formalize parameterization-(in)dependence of the WPO update; provide sufficient conditions and counterexamples.
  \item Justify c-Wasserstein ``squashing'' as a principled modification of the velocity field.
  \item Analyze the variance of sampled updates vs. classic policy gradient (PG); highlight regimes with variance advantages.
  \item Derive updates for non-Gaussian policies (mixtures; bounded/exponential families) consistent with WPO.
  \item Preserve lossless references to the main paper via labels (e.g., \texttt{eqn:wasserstein\_gradient\_flow}).
\end{itemize}

\section{Notation and Standing Assumptions}
We work on measurable spaces where actions $\aVec\in\R^n$ and states $\s\in\R^m$. Policies $\pi(\aVec\mid\s)$ are densities w.r.t. Lebesgue measure, sufficiently smooth in $\aVec$ and in parameters $\thv$.
\begin{assumption}[Regularity]\label{ass:regularity}
For any admissible policy $\pi$ and value functional $\mathcal{J}[\pi]$, the following hold:
\begin{enumerate}[label=(\roman*),leftmargin=*]
  \item The functional derivative $\delta\mathcal{J}/\delta\pi$ exists and is continuously differentiable in $\aVec$.
  \item For each fixed $\s$, $\pi(\cdot\mid\s)$ and $\nablax\frac{\delta\mathcal{J}}{\delta\pi}(\s,\cdot)$ have sufficient decay at infinity so that boundary terms vanish under integration by parts on $\R^n$.
  \item Interchanges of expectations, derivatives, and integrals are justified by dominated convergence (uniform integrable bounds exist).
  \item Parametric family $\pith$ is $\mathcal{C}^2$ in $\thv$ and $\aVec$, and its Fisher information $\mathcal{F}_{\thv\thv}$ is positive definite on the tangent space.
\end{enumerate}
\end{assumption}
\paragraph{Expectation convention.} Unless stated otherwise, expectations are over $\s\sim d^{\pi}$ and $\aVec\sim \pi_{\thv}(\cdot\mid\s)$. We abbreviate $\E_{\s,\aVec}[\cdot]$ for this joint measure.

\paragraph{Frozen-critic scope.} Descent statements below hold for the proxy functional with $Q^{\pi}$ and $d^{\pi}$ treated as fixed during the inner flow/projection step (semi-gradient setting).

\section{Core Identities (for reference)}
The WPO paper’s key expressions (labels in parentheses) are recalled for context:
\begin{align}
  \frac{\partial\pi}{\partial t}
    &= -\nablax\cdot\Big(\pi\,\big(-\nablax\,\frac{\delta\mathcal{J}}{\delta\pi}\big)\Big)
    &&\text{(Wasserstein gradient flow, \texttt{eqn:wasserstein\_gradient\_flow})}
\end{align}
The dynamic and static characterizations of $\Wtwo$ appear as (\texttt{eqn:wasserstein\_distance}, \texttt{eqn:wasserstein\_distance\_dynamic}).
For RL, under a per-state time reparameterization and the frozen-critic convention, we adopt the rescaled functional derivative (cf. \texttt{eqn:functional\_derivative}):
\begin{equation}
  \frac{\delta\mathcal{J}}{\delta\pi}(\s,\aVec) 
  = -\,Q^{\pi}(\s,\aVec),\qquad \text{with } d^{\pi}(\s) \text{ carried only as an outer expectation.}
\end{equation}

\section{Parametric Projection and the WPO Update}
Let $\pith$ be a parametric policy. Minimizing the local KL between the flow step and $\pith$ under a quadratic (Fisher) approximation yields the joint quadratic form with blocks $\mathcal{F}_{tt}$, $\mathcal{F}_{t\thv}$, and $\mathcal{F}_{\thv\thv}$ (cf. \texttt{eqn:fim\_block}). Plugging the flow into $\mathcal{F}_{t\thv}$ and integrating by parts (cf. \texttt{eqn:fttheta\_derivation}) gives
\begin{equation}
  \mathcal{F}_{t\thv} 
   = \E_{\aVec\sim\pi}\big[\nabla_{\thv}\nablax\log\pith(\aVec\mid\s)\,\nablax Q^{\pi}(\s,\aVec)\big].
\end{equation}
The (idealized) WPO update becomes
\begin{equation}\label{eq:pureWPO}
  \thv \leftarrow \thv + \mathcal{F}_{\thv\thv}^{-1} \E_{\pi}[\nabla_{\thv}\nablax\log\pith\,\nablax Q^{\pi}],\quad\text{(cf. \texttt{eqn:wpo\_pure\_update})}
\end{equation}
For Gaussian policies with full covariance $\Sigma$, the covariance update simplifies to $\Delta\Sigma = M+M^\top = 2\,\mathrm{sym}(M)$ with $M=\E[(\nablax Q)(\aVec-\mu)^\top]$.

\begin{lemma}[Integration-by-parts identity]\label{lem:IBP_Ftth}
Under Assumption~\ref{ass:regularity},
\[
\mathcal{F}_{t\thv} = \int \nabla_{\thv}\log\pith(\aVec\mid\s)\,\partial_t \pith(\aVec\mid\s)\,d\aVec
\;=\; \E_{\aVec\sim\pi}\Big[\nabla_{\thv}\nablax\log\pith(\aVec\mid\s)\,\nablax Q^{\pi}(\s,\aVec)\Big].
\]
\end{lemma}
\begin{proof}
Insert the continuity equation for $\partial_t \pith$, expand the divergence, integrate by parts; boundary terms vanish by decay. The product rule recovers $\nabla_{\thv}\nablax\log\pith$.
\end{proof}
\begin{remark}[Baselines]
Per-state baselines $b(\s)$ leave the PDE velocity and the projected cross term unchanged (baseline invariance). Action-dependent adjustments $b(\s,\aVec)$ are not baselines in this sense; they alter $\nablax Q$ and change both the PDE and the projected update.
\end{remark}

\begin{proposition}[Parametric projection]
Under the standing assumptions and a Fisher quadratic approximation of the local KL, the steepest-ascent direction in parameter space induced by the Wasserstein flow equals \eqref{eq:pureWPO}.
\end{proposition}
\begin{proof}[Sketch]
Quadratic KL, compute $\mathcal{F}_{t\thv}$, integrate by parts to move $\nablax$ onto $\log\pith$, and apply the product rule. Boundary terms vanish by Assumption~1.
\end{proof}

\section{Parameterization Independence}
\begin{definition}[Parameterization independence]
An update $\Delta\thv$ is \emph{parameterization independent} if, for any smooth, invertible reparameterization $\thv=g(\eta)$ with Jacobian $J_g$, the induced update satisfies $\Delta\eta = J_g^{-1}\,\Delta\thv$.
\end{definition}

\begin{theorem}[Invariance under reparameterization]\label{thm:natgrad_invariance}
Let $\thv\mapsto\eta=g(\thv)$ be a $\mathcal{C}^1$ diffeomorphism with Jacobian $J_g(\thv)\in\R^{d\times d}$. Denote the Fisher metric in $\thv$ by $\mathcal{F}_{\thv\thv}$ and in $\eta$ by $\mathcal{F}_{\eta\eta}$. If the update direction is the natural gradient
\[
\Delta\thv\;=\;\mathcal{F}_{\thv\thv}^{-1}\,u_{\thv},\qquad
u_{\thv}\;=\;\E_{\pi}\big[\nabla_{\thv}\nablax\log\pith\,\nablax Q^{\pi}\big],
\]
then the induced update in $\eta$ satisfies $\Delta\eta = J_g\,\Delta\thv$ and can be written as $\Delta\eta = \mathcal{F}_{\eta\eta}^{-1} u_{\eta}$ with $u_{\eta}=\E[\nabla_{\eta}\nablax\log\pi_{\eta}\,\nablax Q^{\pi}]$. In particular, the direction is parameterization independent.
\end{theorem}
\begin{proof}
By the tensorial transformation law for the Fisher information (Amari's information geometry), $\mathcal{F}_{\eta\eta} = J_g\,\mathcal{F}_{\thv\thv}\,J_g^{\top}$. By the chain rule, $\nabla_{\eta}\nablax\log\pi_{\eta} = J_g^{-\top}\,\nabla_{\thv}\nablax\log\pith$. Hence
\[
u_{\eta} = \E[\nabla_{\eta}\nablax\log\pi_{\eta}\,\nablax Q] = J_g^{-\top} u_{\thv}.
\]
Therefore $\mathcal{F}_{\eta\eta}^{-1} u_{\eta} = (J_g^{-\top}\mathcal{F}_{\thv\thv}^{-1} J_g^{-1})(J_g^{-\top} u_{\thv}) = J_g\,(\mathcal{F}_{\thv\thv}^{-1} u_{\thv}) = J_g\,\Delta\thv$. The claim follows.
\end{proof}

\begin{proposition}[Critic gradient error and update bias]\label{prop:critic_bias}
Let $\widehat{Q}$ be a critic with gradient error $\varepsilon(\s,\aVec)=\nablax \widehat{Q}(\s,\aVec)-\nablax Q^{\pi}(\s,\aVec)$. Then
\[
\big\|\Delta\thv(\widehat{Q})-\Delta\thv(Q^{\pi})\big\| \;\le\; \big\|\mathcal{F}_{\thv\thv}^{-1}\big\|\,\E_{\pi}\Big[\big\|\nabla_{\thv}\nablax\log\pith\big\|\,\big\|\varepsilon\big\|\Big].
\]
\end{proposition}
\begin{proof}
Triangle inequality and submultiplicativity applied to the difference of expectations under \eqref{eq:pureWPO}.
\end{proof}

\begin{remark}[Counterexample]
For an exponential policy on $\R_+$ with scale $\beta$, a reparameterized SVG(0) update generally differs from the WPO update (cf. \texttt{eqn:exp\_policy\_wpo}, \texttt{eqn:exp\_policy\_svg0}); see the paper’s example.
\end{remark}
\section{Fisher vs. $\Wtwo$ Parametric Projection}
\begin{proposition}[When Fisher and $\Wtwo$ projections agree]\label{prop:fisherW2}
Fix $\s$ and a parametric family whose log-density lies in an exponential family with sufficient statistics affine in $\aVec$, under a parameterization for which $\nabla_{\thv}\nablax\log\pi_{\thv}(\aVec\mid\s)$ spans the same directions as $\nablax\log\pi_{\thv}(\aVec\mid\s)$. If the Wasserstein velocity $\nablax Q^{\pi}(\s,\aVec)$ lies in the closure of this span, then the Fisher and $\Wtwo$ parametric projections of the density flow are equal (collinear) for that $\s$. Otherwise, they differ by a positive self-adjoint operator on the parametric tangent space.
\end{proposition}
\begin{proof}[Proof sketch]
Under the stated affine-in-$\aVec$ condition, the per-state $\Wtwo$ tangent coincides with the image of the parametric tangent under a velocity-potential map. The Fisher and $\Wtwo$ projectors reduce to orthogonal projectors in two inner products on the same finite-dimensional subspace, hence coincide when the velocity lies in the subspace and otherwise differ by a positive operator.
\end{proof}

\section{c-Wasserstein Gradient Flows and Squashing}
Let $c:\R^n\to\R$ be convex, and define the c-Wasserstein distance $\Wc$ (cf. \texttt{eqn:c-wasserstein\_distance}). The associated gradient flow reads (\texttt{eqn:c-wasserstein\_gradient\_flow}):
\begin{equation}
  \frac{\partial\pi}{\partial t} 
  = -\nablax\cdot\Big(\pi\,\nabla c^*\big(-\nablax\,\frac{\delta\mathcal{J}}{\delta\pi}\big)\Big),
\end{equation}
where $c^*$ is the convex conjugate of $c$.

\begin{proposition}[Principled squashing]\label{prop:cW_squash}
Let $c$ be a convex function with $\nabla c^*$ globally Lipschitz. The parametric projection of the c-Wasserstein flow yields the natural-gradient update
\[
\Delta\thv \;=\; \mathcal{F}_{\thv\thv}^{-1}\,\E_{\pi}\big[\nabla_{\thv}\nablax\log\pith\,\nabla c^*(\nablax Q^{\pi})\big],
\]
which reduces to \eqref{eq:pureWPO} when $c(\cdot)=\tfrac{1}{2}\|\cdot\|_2^2$.
\end{proposition}
\begin{proof}[Sketch]
Replace the velocity field $v=-\nablax\,\tfrac{\delta\mathcal{J}}{\delta\pi}$ by $\nabla c^*(-\nablax \tfrac{\delta\mathcal{J}}{\delta\pi})$ in the continuity equation and repeat the Fisher quadratic projection. The Lipschitzness of $\nabla c^*$ ensures well-posedness and exchange of integration-by-parts as in the quadratic case.
\end{proof}

\begin{remark}[Design of squashing]
Choosing $\nabla c^*(z)=z^{1/3}$ (elementwise) corresponds to $c$ with superquadratic growth, damping large gradients while preserving monotonicity. Other odd, monotone choices are possible and inherit a metric interpretation.
\end{remark}

\section{Variance of Sampled Updates}
Mixed partials $\nabla_{\thv}\nablax\log\pi$ can be obtained efficiently via JVP/VJP in autodiff systems (one forward and one reverse pass). The critic action-gradient $\nablax Q$ is computed by differentiating the critic wrt its action input. A score-aligned control variate subtracts $c(\s)\,\nabla_{\thv}\log\pi(\aVec\mid\s)$ with optimal
\begin{equation}\label{eq:cv_opt_notes}
  c^*(\s) = \frac{\Cov(\nabla_{\thv}\log\pi,\, \widehat g_{\thv}\,|\,\s)}{\Var(\nabla_{\thv}\log\pi\,|\,\s)},\qquad \widehat g_{\thv} = \nabla_{\thv}\nablax\log\pi\,\widehat{\nablax Q},
\end{equation}
estimated online via running moments.

For the Gaussian mean update (cf. \texttt{eqn:gaussian\_mean\_update}) the per-sample contribution is $\nablax Q(\s,\aVec)\,\nabla_{\thv}\mu(\s)$. Thus
\begin{equation}
  \Var[\Delta\thv] = \Var\big[\nablax Q(\s,\aVec)\big]\,\E\big[\nabla_{\thv}\mu\,\nabla_{\thv}\mu^\top\big] + \text{cross-terms}.
\end{equation}
\begin{lemma}[Linear-$Q$ regime]\label{lem:var_linear}
If $Q(\s,\aVec)=w(\s)^\top \aVec$, then $\nablax Q(\s,\aVec)=w(\s)$ is action-independent and the Gaussian mean update has zero sampling variance for any fixed $\s$.
\end{lemma}
\begin{proof}
Each sample contributes $w(\s)\,\nabla_{\thv}\mu(\s)$ to the Monte Carlo estimator of $\Delta\thv$, hence all per-sample contributions are identical for fixed $\s$; averaging over i.i.d. samples yields zero variance. Aggregating over a state distribution preserves the property.
\end{proof}

\begin{proposition}[Quadratic-$Q$ case]\label{prop:var_quadratic}
Let $Q(\s,a)=b(\s)\,a-\tfrac{\kappa(\s)}{2}a^2$ in 1D and a Gaussian policy $\mathcal{N}(\mu(\s),\sigma^2(\s))$. Then the per-state variance of the mean update equals
\[
\Var[\Delta\thv\mid\s]=\Var\big[(b-\kappa a)\,\nabla_{\thv}\mu\big]=\kappa^2\,\sigma^2\,\|\nabla_{\thv}\mu\|_2^2,
\]
and decays linearly with the action variance $\sigma^2(\s)$.
\end{proposition}
\begin{proof}
$\E[a\mid\s]=\mu$, $\Var[a\mid\s]=\sigma^2$. Since $\nabla_{\thv}\mu$ is deterministic for fixed $\s$,
\(\Var[(b-\kappa a)\,\nabla_{\thv}\mu]=\Var[b-\kappa a]\,\|\nabla_{\thv}\mu\|^2=\kappa^2\Var[a]\,\|\nabla_{\thv}\mu\|^2=\kappa^2\sigma^2\,\|\nabla_{\thv}\mu\|^2.\)
\end{proof}

\section{Non-Gaussian Policies}
\subsection{Mixture of Gaussians}\label{subsec:mog}
Let $\pi(a)=\sum_i\rho_i\,\mathcal{N}(a\mid\mu_i,\sigma_i^2)$ in 1D. Denote $\phi_i$ the component responsibilities for a sampled $a$. Then, omitting Fisher preconditioning details,
\begin{align}
  \Delta\mu_i &\propto \E\big[\phi_i(a)\,\partial_a Q(a)\big],\\
  \Delta\sigma_i &\propto \E\big[\phi_i(a)\,\frac{a-\mu_i}{\sigma_i}\,\partial_a Q(a)\big],\\
  \Delta\rho_i &\propto \E\big[\partial_a Q(a)\,\partial_a \log \pi(a)\,\partial_{\rho_i}\log \pi(a)\big].
\end{align}
Stability benefits from scaling heuristics inspired by the Gaussian Fisher (cf. main paper).

\paragraph{Details.} Responsibilities $\phi_i(a)=\frac{\rho_i\,\mathcal{N}(a\mid\mu_i,\sigma_i^2)}{\pi(a)}$. Gradients of the log mixture satisfy
\[
\partial_{\mu_i}\log\pi(a)=\phi_i(a)\,\frac{a-\mu_i}{\sigma_i^2},\qquad
\partial_{\sigma_i}\log\pi(a)=\phi_i(a)\,\Big(\frac{(a-\mu_i)^2}{\sigma_i^3}-\frac{1}{\sigma_i}\Big),
\]
and for weights either $\partial_{\rho_i}\log\pi(a)=\phi_i(a)/\rho_i$ (simplex with positivity constraint) or, using softmax parameters $w$, $\partial_{w_i}\log\pi(a)=\phi_i(a)-\rho_i$.
Diagonal-Fisher scaling analogous to the Gaussian case (using responsibilities) mitigates variance blow-up.

\subsection{Bounded / Exponential Families}
\paragraph{Change-of-variable approach.}
For action constraints (e.g., box constraints), represent an unconstrained base $z\in\R^n$ via a diffeomorphism $\tau:\R^n\to\mathcal{A}$ (e.g., tanh with scaling), set $a=\tau(z)$ and define $\pi(a\mid\s)=\pi_Z(z\mid\s)\,|\det \nabla\tau^{-1}(a)|$. Then
\[
\nablax \log \pi(a\mid\s)= (\nabla a/\nabla z)^{-\top}\,\nabla_z \log \pi_Z(z\mid\s) - \nabla_a \log|\det \nabla\tau^{-1}(a)|,
\]
and the WPO update can be computed in $z$-space and mapped back. Regularity of $\tau$ controls boundary terms.

\begin{proposition}[Diagonal Fisher scaling for Gaussians]\label{prop:diag_fisher}
For Gaussian policies with diagonal covariance, redefining $\overline{\partial_{\mu_i}}\log\mathcal{N}=\sigma_i^2\,\partial_{\mu_i}\log\mathcal{N}$ and $\overline{\partial_{\sigma_i}}\log\mathcal{N}=\tfrac{1}{2}\sigma_i^2\,\partial_{\sigma_i}\log\mathcal{N}$ removes the $\sigma_i^{-2}$ blow-up in the WPO update as $\sigma_i\to 0$ while preserving the direction of ascent.
\end{proposition}
\begin{proof}[Idea]
Under this rescaling, the preconditioned update cancels the $\sigma^{-2}$ factors from $\nablax\log\pi$, keeping finite contributions in the small-variance limit; cf. the diagonal entries of the Gaussian Fisher.
\end{proof}

\section{Cross-References to Paper Labels}
Throughout, we use canonical labels from the WPO paper: \texttt{eqn:wasserstein\_gradient\_flow}, \texttt{eqn:wasserstein\_distance}, \texttt{eqn:wasserstein\_distance\_dynamic}, \texttt{eqn:cost\_to\_go\_derivative}, \texttt{eqn:fttheta\_derivation}, \texttt{eqn:wpo\_pure\_update}, \texttt{eqn:c-wasserstein\_distance}, \texttt{eqn:c-wasserstein\_gradient\_flow}, and others listed in the agent guide.
For a positive action with exponential policy $p(a\mid\beta)=\beta^{-1}\exp(-a/\beta)$, one finds a discrepancy between WPO and reparameterized SVG(0) (cf. \texttt{eqn:exp\_policy\_wpo}, \texttt{eqn:exp\_policy\_svg0}), showcasing parameterization effects when not using the natural metric.

\section{Implementation Heuristics (Theory-Aligned)}
\begin{itemize}[leftmargin=*]
  \item \textbf{Diagonal Fisher scaling:} cancel the blow-up of $\nablax\log\pi$ as variance shrinks (cf. Gaussian blocks \texttt{eqn:gaussian\_grads\_fisher}).
  \item \textbf{KL regularization:} trust-region behavior for stable training; compute KL gradient in the standard Fisher-Rao geometry.
  \item \textbf{Sampling:} off-policy states (from replay), on-policy actions (from current $\pi$) as in the paper.
  \item \textbf{c-Wasserstein squashing:} elementwise odd, monotone functions (e.g., cube root) interpreted via $\nabla c^*$.
\end{itemize}

\section{Experiments to Validate Claims (Mac-feasible)}
\begin{itemize}[leftmargin=*]
  \item \textbf{Variance demo:} linear-$Q$ toy shows near-zero variance for WPO mean update vs. PG.
  \item \textbf{Mixture dynamics:} 1D mixture exhibits qualitative differences vs. PG (cf. \texttt{fig:mog}).
  \item \textbf{Constrained actions (tanh):} tanh-squashed Gaussian experiments with diagonal Fisher preconditioning demonstrate stable $\sigma$ adaptation; see the \emph{Tanh Policy} evolution plots.
  \item \textbf{LQR-style dynamic:} simple one-step LQR surrogate with analytic $\nablax Q$ confirms stable behavior under constraints and agrees with the variance analysis (Lemma~\ref{lem:var_linear}, Prop.~\ref{prop:var_quadratic}).
  \item \textbf{Classic control:} small tasks (e.g., Pendulum) remain a future target once we introduce a learned critic; current results are analytic-toy consistent.
\end{itemize}

\section{Empirical Observations (Summary)}
\begin{itemize}[leftmargin=*]
  \item \textbf{Parameterization independence:} empirical invariance effects match Theorem~\ref{thm:natgrad_invariance} under smooth changes; constrained reparameterizations require care (cf. open problems).
  \item \textbf{c-Wasserstein squashing:} using cube-root squashing behaves as principled damping per Prop.~\ref{prop:cW_squash} and improves robustness to large $\nablax Q$.
  \item \textbf{Variance:} linear-$Q$ demo confirms near-zero variance for WPO vs. higher PG variance (Lemma~\ref{lem:var_linear}); quadratic trends align with Prop.~\ref{prop:var_quadratic}).
  \item \textbf{MoG vs PG:} mixture evolution differences are visible in means/variances; diagonal Fisher-like rescaling stabilizes training (\S\ref{subsec:mog}).
  \item \textbf{Constrained tanh:} diagonal Fisher preconditioning avoids $\sigma$ blow-up and supports steady adaptation; consistent with the scaling intuition in Prop.~\ref{prop:diag_fisher}.
\end{itemize}

\section{Open Problems and TODOs}
\begin{itemize}[leftmargin=*]
  \item Extend invariance results under constrained reparameterizations (manifolds with boundary).
  \item Quantify bias introduced by c-Wasserstein squashing for common $c$ (e.g., Huber/smooth-$\ell_p$) and its effect on convergence rates.
  \item Characterize landscape (critical points, stability) for mixture policies under WPO updates.
  \item Analyze partial observability and its impact on $\nablax Q$ estimation quality.
\end{itemize}

\end{document}
