\documentclass{article}
\usepackage{graphicx} % Required for inserting images

% Also
\usepackage[utf8]{inputenc} % allow utf-8 input
\usepackage[T1]{fontenc}    % use 8-bit T1 fonts
\usepackage{hyperref}       % hyperlinks
\usepackage{url}            % simple URL typesetting
\usepackage{booktabs}       % professional-quality tables
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
\usepackage{microtype}      % microtypography
\usepackage{xcolor}         % colors

\usepackage{comment}
\usepackage{bm}
\usepackage{mathtools}
\usepackage{amssymb}
\usepackage{enumitem} 
\usepackage{amsthm}
\usepackage{booktabs}
\usepackage{xfrac}
\usepackage{caption}
\usepackage{subcaption}
\newtheorem{theorem}{Theorem}
\theoremstyle{definition}
\newtheorem{definition}{Definition}
\newtheorem{lemma}{Lemma}
\newtheorem{corollary}{Corollary}[theorem]
\usepackage{breqn}

\newcommand{\SSigma}{\boldsymbol{\Sigma}}

\title{Differentiable?}

\begin{document}

\maketitle

\section{Convexity Revisited}

Are we certain that our optimization problem is non-convex?

Consider the task of computing $\theta_+$, i.e. the maximal ATE consistent with the data and our structural assumptions. 
First, we compute $\rho^* = \min_\rho \lVert \bm{\gamma} \rVert_p$, which is guaranteed to exist and be unique for $p \geq 1$. Now we have a partition in $\rho$ space: $[-1, \rho^*), (\rho^*, 1]$. We know that $\theta_+$ must live on the first interval. Recall that our computational sequence runs as follows:
\begin{align*}
    \rho \rightarrow \theta \rightarrow \lVert \bm{\gamma} \rVert_p \rightarrow \ell,
\end{align*}
where we define our loss function as $\ell := \big(\tau - \lVert \bm{\gamma} \rVert_p \big)^2$. 
This loss function follows from the fact that we aim to find the point at which $\tau = \lVert \bm{\gamma} \rVert_p$. 
If $\ell$ were convex in $\rho$, then we could solve the optimization problem by setting $\frac{\partial \ell}{\partial \rho} = 0$. 
By the chain rule, we have:
\begin{align*}
    \frac{\partial \ell}{\partial \rho} = \frac{\partial \ell}{\partial \lVert \bm{\gamma} \rVert_p} \times \frac{\partial \lVert \bm{\gamma} \rVert_p}{\partial \theta} \times \frac{\partial \theta}{\partial \rho}. 
\end{align*}
If each link in the chain is convex, then the partial derivative on the lhs should exist. Working from right to left, we find that $\theta$ is quadratic in $\rho$ (recall our range restriction, so there is no sigmoid to contend with); $\lVert \bm{\gamma} \rVert_p$ is convex in $\theta$ for $p \geq 1$; and $\ell$ is quadratic in $\lVert \bm{\gamma} \rVert_p$. So...isn't the whole thing just convex? Am I missing something?

Another way to think about the above convexity question: let's assume $\mathbf \Sigma$ and $\tau$ are fixed. Then we can write $\theta = g(\rho)$ and $\lVert \bm{\gamma} \rVert_p = f(\theta)$, for some bijective functions $f, g$. Now the loss function can be rewritten $\ell = \Big(\tau - f\big(g( \rho )\big)\Big)^2$. If $f$ and $g$ are convex, then surely we can conclude that $\ell$ is convex in $\rho$. We know that $g$ is convex, since it is just quadratic (given the range restriction on $\rho$---recall that we are just looking for $\theta_+$). Then it only remains to show that $f$ is convex. I believe that for any norm with $p \geq 1$, this holds. Where precisely does the non-convexity live?

Note that if we knew the inverse function for $f$, this problem would be trivial. Our solution would just reduce to:
\begin{align*}
    \theta_+ = f^{-1}(\tau),
\end{align*}
which is guaranteed to be unique on the interval $[-1, \rho^*)$. Can we invert the function? Recall:
\begin{align*}
    \mathbf{\Sigma_{Zy}} &= \mathbf{\Sigma_Z \gamma} + \mathbf{\Sigma_{Zx}}\theta\\
    \mathbf{\Sigma_{Zx}}\theta &= \mathbf{\Sigma_{Zy}} - \mathbf{\Sigma_Z \gamma}\\
    \theta &= \mathbf{\Sigma_{Zx}^{-1}} (\mathbf{\Sigma_{Zy}} - \mathbf{\Sigma_Z \gamma})
\end{align*}
but that doesn't really give us a map for $\lVert \mathbf{\gamma} \rVert_2 \mapsto \theta$...

For what it's worth, here's something:
\begin{align*}
    \frac{\partial \lVert \bm{\gamma} \rVert_2}{\partial \theta} = 0.5 \times (\sum (2 (\mathbf{\Sigma^{-1}} \cdot -\mathbf{\Sigma_{Zx}} \cdot \theta ))) / \theta
\end{align*}

\section{Minimum Leakage}

Here's a crack for the one-d case:
\begin{align*}
    \rho^2 = \frac{- a \eta_x^2 - \sqrt{a^2 \eta_x^4 - 2 (b^2-a c) \eta_x^4}}{2 \eta_x^4}
\end{align*}
Which, exploiting $a = - \eta_x^2$, becomes:
\begin{align*}
    \rho^2 = \frac{a^2 - \sqrt{a^4 - 2 (b^2-a c) a^2}}{2 a^2}
\end{align*}
Or:
\begin{align*}
    \rho^2 = \frac{1}{2}\left(1 - \sqrt{1 - 2 \frac{b^2-ac}{a^2}}\right)
\end{align*}


For $p \geq 1$, it turns out that $\lVert \bm{\gamma} \rVert_p$ is a differentiable function of $\rho$. We can therefore compute the minimum possible information leakage by setting this partial derivative to zero. Let's take a look at the formula for $p=2$. 

\begin{align*}
    \frac{\partial h}{\partial \rho} &= \frac{1}{p} \sum_{j=1}^{d_Z} \Big( |\gamma_j|^p \Big) ^{(1/p - 1)} \times \\
    \sum_{j=1}^{d_Z} &\Big[ p |\gamma_j|^{(p-1)} ~\text{sgn}(\bm{\gamma}) ~\SSigma_{\mathbf{Z}}^{-1} \big( \eta_x^2 \rho ~\SSigma_{\mathbf{Zx}} (\frac{}{}) \big) \Big]
\end{align*}


Preliminaries:
\begin{align*}
    a1 &:= \SSigma_{xZ} \SSigma_{\mathbf Z}^{-1}\\
    a3 &:= a1 \Sigma_{Zx} - \sigma_x^2\\
    a4 &:= \\eta_x^2\\
    a5 &:= a4 ~\rho^2\\
    a7 &:= a3/a5 + 1\\
    a8 &:= a1 ~\SSigma_{Zy}\\
    a11 &:= (a8 - \sigma_{xy})^2 - a3 ~(\SSigma_{yZ} \SSigma_{\mathbf Z}^{-1} \SSigma_{Zy} - \sigma^2_y)\\
    a13 &:= \sqrt{a7 ~a11}\\
    a14 &:= \text{sgn}(\rho)\\
    a20 &:= \SSigma_{\mathbf Z}^{-1} (\SSigma_{Zy} - \SSigma_{Zx} (a8 - (\sigma_{xy} + a14 ~a13/a7)) / a3)\\
    a21 &:= |a20|
\end{align*}

And now:
\begin{dmath*}
    \sum(a21^p)^{(1/p - 1)} \times \sum(p \times a21^{(p-1)} \times \text{sgn}(a20) \times \SSigma^{-1}_Z (a4 \times \rho \SSigma_{Zx} \times (2 \times (a13 / a7) - a11 / a13) \times a14 / (a7 \times a5^2))) / p
\end{dmath*}

For Wolfram, let's assign:
\begin{align*}
    a &:= a1\\
    b &:= a3\\
    c &:= a4\\
    d &:= a5\\
    f &:= a7\\
    g &:= a8\\
    h &:= a11\\
    j &:= a13\\
    k &:= a14\\
    m &:= a20\\
    q &:= a21\\
\end{align*}
Now we have:
\begin{dmath*}
    0 = \frac{1}{p} \times \sum(q^p)^{(1/p - 1)} \times \sum \bigg[p \times q^{(p-1)} \times \text{sgn}(m) \times \SSigma^{-1}_Z \Big( \\eta_x^2 \times \rho \SSigma_{Zx} \times \big(2 (j / f) - h / j\big) \times k / (f d^2) \Big) \bigg]
\end{dmath*}
Note that $\rho$ appears in: $d, f, j, k, m, q$. We can drop the first factor (that's not what's making it zero!) and place $\rho$ back in everywhere it appears.
\begin{align*}
    m &:= \SSigma_{\mathbf Z}^{-1} (\SSigma_{Zy} - \SSigma_{Zx} (a8 - (\sigma_{xy} + \text{sgn}(\rho) \sqrt{h(1 + b/(\\eta_x^2 \rho^2))})) / a3)\\
\end{align*}



\begin{dmath*}
    0 = \sum(q^p)^{(1/p - 1)} \times \sum \bigg[p \times q^{(p-1)} \times \text{sgn}(m) \times \SSigma^{-1}_Z \Big( \\eta_x^2 \times \rho \SSigma_{Zx} \times \big(2 (j / f) - h / j\big) \times k / (f d^2) \Big) \bigg]
\end{dmath*}



If this product is zero, then it must be $\rho$ that's making it zero so we can drop the first two factors and just say:
\begin{dmath*}
    0 = \sum \bigg[p \times q^{(p-1)} \times \text{sgn}(m) \times \SSigma^{-1}_Z \Big( \\eta_x^2 \times \rho \SSigma_{Zx} \times \big(2 (j / f) - h / j\big) \times k / (f d^2) \Big) \bigg]
\end{dmath*}

Does that help? Let's see...
\begin{dmath*}
    \frac{\partial \lVert \bm{\gamma} \rVert_p}{\partial \rho} = 
\end{dmath*}

\end{document}