% \documentclass{uai2023} % for initial submission
\documentclass[accepted]{uai2023} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like

%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
% Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
 % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

% for cross referencing the main text
% PLEASE ONLY USE xr IN THE SUPPLEMENTARY MATERIAL. 
% In the main paper, hard code any cross-reference to the supplementary material. 
\usepackage{xr-hyper} 
% \externaldocument{Guo_576}


%############################################
% packages added by authors 

%% HELPER CODE FOR DEALING WITH EXTERNAL REFERENCES

\makeatletter
\newcommand*{\addFileDependency}[1]{% argument=file name and extension
  \typeout{(#1)}
  \@addtofilelist{#1}
  \IfFileExists{#1}{}{\typeout{No file #1.}}
}
\makeatother

\newcommand*{\myexternaldocument}[1]{%
    \externaldocument{#1}%
    \addFileDependency{#1.tex}%
    \addFileDependency{#1.aux}%
}

% %%% END HELPER CODE

% Put all the external documents here!
\myexternaldocument{Guo_576}
\myexternaldocument{table2}
\myexternaldocument{table3}

% packages for tables
\usepackage{array}
\usepackage{caption}
\usepackage{graphicx}
\usepackage{siunitx}
\usepackage[normalem]{ulem}
\usepackage{colortbl}
\usepackage{multirow}
\usepackage{hhline}
\usepackage{calc}
\usepackage{tabularx}
\usepackage{threeparttable}
\usepackage{wrapfig}
\usepackage{adjustbox}
\usepackage{hyperref}


\usepackage{amsmath, amssymb, amsfonts}
\usepackage{xcolor}
\usepackage{color}
%\usepackage{tikz}
\usetikzlibrary{tikzmark}
\usepackage{multirow}
\usepackage{bm}
\usepackage{graphicx}
\usepackage{enumerate}
\usepackage[makeroom]{cancel}
\usepackage{hyperref}
\usepackage{xcolor}

\usepackage{amsthm}
\newtheorem{cor}{Corollary}
\newtheorem{prop}{Proposition}
\newtheorem{lemma}{Lemma}
\newtheorem{theorem}{Theorem}
\newtheorem{corollary}{Corollary}[theorem]
\theoremstyle{remark}
\newtheorem*{remark}{Remark}
\newtheorem*{claim}{Claim}
\theoremstyle{definition}
\newtheorem{definition}{Definition}

\newcommand{\red}{\textcolor{red}}
\newcommand{\anna}{\textcolor{olive}}
\def\ci{\perp\!\!\!\perp}
\newcommand{\E}{\mathbb{E}}
\newcommand{\N}{\mathbb{N}}
\DeclareMathOperator{\pa}{pa} 

\allowdisplaybreaks

%############################################


%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

% \title{Identifiability and Estimation under Missing Not at Random Mechanisms\\ (Supplementary Material)}

% \title{Partial Identification and Semiparametric Estimation under Missing Not at Random Mechanisms\\ (Supplementary Material)}

\title{Sufficient Identification Conditions and Semiparametric Estimation under Missing Not at Random Mechanisms \\ (Supplementary Material)}

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<anna.guo@emory.edu>?Subject=Your UAI 2023 paper}{Anna~Guo}{}}
\author[2]{Jiwei~Zhao}
\author[1]{Razieh~Nabi}
% Add affiliations after the authors
\affil[1]{%
    Dept. of Biostatistics and Bioinformatics\\
    Emory University\\
    Atlanta, Georgia, USA
}
\affil[2]{%
    Dept. of Biostatistics \& Medical Informatics\\
    University of Wisconsin\\
    Madison, Wisconsin, USA
}


\begin{document}
  
\onecolumn %% Turn this off if single column is desired for the supplement
\maketitle

The appendix is organized as follows. 
In Appendix~\ref{app:nonid-ex}, we provide a counterexample for lack of target law identification in the criss-cross MNAR model using continuous variables under normal distributions. 
Appendix~\ref{app:id-proofs} contains our identification proofs in the exponential family distribution: target law with univariate $X$ (\ref{app:sub-target-id-uni}), target law with multivariate $X$ (\ref{app:sub-target-id-multi}) and full law (\ref{app:sub-full-id}). 
In Appendix~\ref{app:parID-ex}, we include several examples on parametric identification of popular distributions in the exponential family distributions. 
Appendix~\ref{app:est-proofs} contains our proofs regarding asymptotic behaviors of our suggested estimators for conditional likelihood with order statistics (\ref{app:sub-est-order}) and generalized method of moments (\ref{app:sub-est-gmm}). In Appendix~\ref{app:est-additional}, we provide additional discussions on (non)parametric estimation approaches. Appendix~\ref{app:sims} contains additional experiments. 

\appendix

%##############################################
\section{Counterexample for lack of target law identification}\label{app:nonid-ex}

Consider two distinct distributions $p_1$ and $p_2$ defined over variables in $\{X, Y, R_x, R_y\}$  as follows: 

\textbf{Model 1:} $Y\sim \N(1,1),\, X\mid Y\sim \N(y,1),\, p_1(R_x=1\mid y)=\frac{\sqrt{5/6}}{\sqrt{5/6}+\exp \left[-\frac{1}{12}(y-1)^{2}\right]}$, and $$p_1(R_y=1\mid x, R_x) =
    \begin{cases}
      \phi(x),\text{ when }R_x=1\\
      \phi(\frac{x-5}{\sqrt{5}}),\text{ when }R_x=0
    \end{cases}   $$
    
\textbf{Model 2:} $Y\sim \N(1,\frac{6}{5}),\, X\mid Y\sim \N(y,1),\, p_2(R_x=1\mid y)=\frac{\exp \left[-\frac{1}{12}(y-1)^{2}\right]}{\sqrt{5/6}+\exp \left[-\frac{1}{12}(y-1)^{2}\right]}$, and $$p_2(R_y=1\mid x, R_x) =
    \begin{cases}
      \phi(x),\text{ when }R_x=1\\
      \exp (-\frac{8}{9})*\sqrt{\frac{2}{5}}\phi(x-\frac{7}{3}),\text{ when }R_x=0.
    \end{cases}$$
\label{app:counter}%
 Here $\phi(.)$ denotes the standard normal CDF, and $p_i(x,y,R_x,R_y)=p_i(y) \ p(x\mid y) \ p_i(R_x\mid y) \ p_i(R_y\mid x,R_x),\,i=1,2$. Note that $p_1 \not= p_2$. In what follows, we analyze the four missingness patterns one by one and show that the above two models map to the exact same observed data distribution and thus the target law is not identifiable as a unique function of the observed data law.   
\begin{enumerate}
    \item \underline{Missingness pattern $(R_x=1, R_y=1)$.} We need to prove  
    $$p_1(x, y, R_x = 1, R_y = 1) = p_2(x, y, R_x = 1, R_y = 1).$$ 
    This holds since 
    \begin{align*}
        &p_{1}(y) \ p( x\mid y) \ p_{1}(R_{x}=1 \mid y) \ p_{1}\left(R_{y}=1 \mid x, R_{x}=1\right)\\
        &=\frac{1}{\sqrt{2 \pi}} \exp \left\{-\frac{1}{2}(y-1)^{2}\right\}\times p(x \mid y) \times \frac{\sqrt{\frac{5}{6}}}{\sqrt{\frac{5}{6}}+\exp \left[-\frac{1}{12}(y-1)^{2}\right]} \times  \frac{1}{\sqrt{2 \pi}} \exp \left\{-\frac{1}{2} x^{2}\right\} \\
        &=\frac{1}{\sqrt{2 \pi}\sqrt{\frac{6}{5}}} \exp \left\{-\frac{1}{2\times\frac{6}{5}}(y-1)^{2}\right\}\times p(x \mid y) \times \frac{\exp \left[-\frac{1}{12}(y-1)^{2}\right]}{\sqrt{\frac{5}{6}}+\exp \left[-\frac{1}{12}(y-1)^{2}\right]} \times  \frac{1}{\sqrt{2 \pi}} \exp \left\{-\frac{1}{2} x^{2}\right\} \\
        % &=\frac{1}{\sqrt{2 \pi}\sqrt{\frac{6}{5}}} \exp \left\{-\frac{1}{2}(y-1)^{2}\right\} \cdot \frac{1}{\sqrt{\frac{5}{6}}+\exp \left[-\frac{1}{12}(y-1)^{2}\right]} \cdot  \frac{1}{\sqrt{2 \pi}} \exp \left\{-\frac{1}{2} x^{2}\right\} \cdot p(x \mid y)\\
        &=p_{2}(y) \ p( x\mid y) \ p_{2}(R_{x}=1 \mid y) \ p_{2}\left(R_{y}=1 \mid x, R_{x}=1\right). 
    \end{align*}
    
    \item \underline{Missingness pattern $(R_x=1, R_y=0)$.} We need to prove
    $$\int p_1(x, y, R_x = 1, R_y = 0) dy = \int p_2(x, y, R_x = 1, R_y = 0) dy.$$ 
    That is, 
    \begin{align*}
        &\int p_{1}(y) p{(x \mid y)} p_{1}\left(R_{x}=1 \mid y\right) p_{1}\left(R_{y}=0 \mid x, R_{x}=1\right) d y\\
        &\hspace{1cm}=\int p_{2}(y) p{(x \mid y)} p_{2}\left(R_{x}=1 \mid y\right) p_{2}\left(R_{y}=0 \mid x, R_{x}=1\right) d y.
     \end{align*}
     Or in other words: 
    \begin{align*}
        &\int p_{1}(y) p(x\mid y) p_{1}\left(R_{x}=1 \mid y\right) d y-\int p_{1}(y) p(x\mid y) p_{1}\left(R_{x}=1 \mid y\right) p_1(R_y=1\mid x, R_x=1)d y\\
        &\hspace{0.5cm}=\int p_{2}(y) p(x\mid y) p_{2}\left(R_{x}=1 \mid y\right) d y-\int p_{2}(y) p(x\mid y) p_{2}\left(R_{x}=1 \mid y\right) p_2(R_y=1\mid x, R_x=1)d y. 
    \end{align*}
    Since $\int p_{1}(y) p(x\mid y) p_{1}\left(R_{x}=1 \mid y\right) p_1(R_y=1\mid x, R_x=1)d y =\int p_{2}(y) p(x\mid y) p_{2}\left(R_{x}=1 \mid y\right) p_2(R_y=1\mid x, R_x=1)d y$ holds by the missingness pattern $(R_x = 1, R_y = 1)$, we only need to show $$\int p_{1}(y) p(x\mid y) p_{1}\left(R_{x}=1 \mid y\right) d y=\int p_{2}(y) p(x\mid y) p_{2}\left(R_{x}=1 \mid y\right) d y.$$
    We have: 
    \begin{align*}
        &p_{1}(y) \ p(x\mid y) \ p_{1}\left(R_{x}=1 \mid y\right)\\
        &\hspace{1cm}=\frac{1}{\sqrt{2 \pi}} \exp \left\{-\frac{1}{2}(y-1)^{2}\right\} \times p(x \mid y)\times \frac{\sqrt{\frac{5}{6}}}{\sqrt{\frac{5}{6}}+\exp \left[-\frac{1}{12}(y-1)^{2}\right]} \\
        &\hspace{1cm}=\frac{1}{\sqrt{2 \pi}\sqrt{\frac{6}{5}}} \exp \left\{-\frac{1}{2\times\frac{6}{5}}(y-1)^{2}\right\}  \times  p(x \mid y)\times \frac{\exp \left[-\frac{1}{12}(y-1)^{2}\right]}{\sqrt{\frac{5}{6}}+\exp \left[-\frac{1}{12}(y-1)^{2}\right]} \\
        &\hspace{1cm}=p_{2}(y) \ p(x\mid y) \ p_{2}\left(R_{x}=1 \mid y\right).
    \end{align*}
    \item \underline{Missingness pattern ($R_x=0,\, R_y=1$).} We need to prove 
    $$\int p_1(x, y, R_x = 0, R_y = 1) dx = \int p_2(x, y, R_x = 0, R_y = 1) dx.$$ 
    For any $\mu\text{ and }\sigma>0$, it is true that 
    $$\resizebox{\textwidth}{!}{\begin{aligned}
    &\int \phi(x-y)\times\phi(\frac{x-\mu}{\sigma})d x\\
    &=\int \frac{1}{\sqrt{2 \pi}} \exp \left\{-\frac{1}{2}(x-y)^{2}\right\} \times \frac{1}{\sqrt{2 \pi} \sigma} \exp \left\{-\frac{1}{2 \sigma^{2}}(x-\mu)^{2}\right\} d x\\
    &=\frac{1}{\sqrt{2} \pi} \times \frac{1}{\sqrt{2 \pi} \sigma} \int \exp \left\{-\frac{1}{2} x^{2}+x y-\frac{1}{2} y^{2}-\frac{1}{2 \sigma^{2}} x^{2}+\frac{1}{\sigma^{2}} x \mu-\frac{1}{2 \sigma^{2}} \mu^{2}\right\} d x\\
    &=\frac{1}{\sqrt{2 \pi}} \frac{1}{\sqrt{2 \pi} \sigma} \times \int \exp \left\{-\frac{1}{2 \times \frac{\sigma^{2}}{\sigma^{2}+1}}\left[x^{2}-2 x\left(y+\frac{\mu}{\sigma^{2}}\right) \frac{\sigma^{2}}{\sigma^{2}+1}+\left(y+\frac{\mu}{\sigma^{2}}\right)^{2}\left(\frac{\sigma^{2}}{\sigma^{2}+1}\right)^{2}\right]\right\} \exp \left[-\frac{1}{2} y^{2}-\frac{1}{2 \sigma^{2}} \mu^{2}+\frac{1}{2 \frac{\sigma^{2}}{\sigma^{2}+1}} \times\left(y+\frac{\mu}{\sigma^{2}}\right)^{2}\left(\frac{\sigma^{2}}{\sigma^{2}+1}\right)^{2}\right] d x\\
    &=\frac{1}{\sqrt{2 \pi}} \times \sqrt{\frac{1}{1+\sigma^{2}}} \times \exp \left[-\frac{1}{2} \frac{1}{1+\sigma^{2}} y^{2}+\frac{1}{1+\sigma^{2}} \mu y-\frac{1}{2} \frac{\mu^{2}}{1+\sigma^{2}}\right]. 
    \end{aligned}}$$
    
    Thus, we have: 
    \begin{align*} 
    & p_{1}(y) p_{1}\left(R_{x}=0 \mid y\right) \int p(x \mid y) p_{1}\left(R_{y}=1 \mid x, R_{x}=0\right) d x \\ 
    &=\frac{1}{\sqrt{2 \pi}}\exp \left\{-\frac{1}{2}(y-1)^{2}\right\} \times \frac{\exp \left[-\frac{1}{12}(y-1)^{2}\right]}{\sqrt{\frac{5}{6}}+\exp \left[-\frac{1}{12}(y-1)^{2}\right]}\times \frac{1}{\sqrt{2 \pi}} \sqrt{\frac{1}{6}} \exp \left[-\frac{1}{12} y^{2}+\frac{5}{6} y-\frac{1}{2} \times \frac{25}{6}\right]\\
    &=\frac{1}{2 \pi} \sqrt{\frac{1}{6}} \frac{1}{\sqrt{\frac{5}{6}}+\exp \left[-\frac{1}{12}(y-1)^{2}\right]}\times \exp \left\{-\frac{7}{12}(y-1)^{2}-\frac{1}{12} y^{2}+\frac{5}{6} y-\frac{1}{2} \times \frac{25}{6}\right\}\\
    &=\frac{1}{2 \pi} \sqrt{\frac{1}{6}} \frac{1}{\sqrt{\frac{5}{6}}+\exp \left[-\frac{1}{12}(y-1)^{2}\right]}\times \exp \left\{-\frac{2}{3} y^{2}+2 y-\frac{8}{3}\right\}\\
    %&=  p_{2}(y) P_{2}\left(R_{x}=0 \mid y\right) \int P(x \mid y) p_{2}\left(R_{y}=1 \mid x, R_{x}=0\right) d x\\
    &=p_{2}(y) p_{2}\left(R_{x}=0 \mid y\right) \int p(x \mid y) p_{2}\left(R_{y}=1 \mid x, R_{x}=0\right) d x\\
    &=\frac{1}{\sqrt{2 \pi}}\exp \left\{-\frac{1}{2 \times \frac{6}{5}}(y-1)^{2}\right\} \times \frac{\sqrt{\frac{5}{6}}}{\sqrt{\frac{5}{6}}+\exp \left[-\frac{1}{12}(y-1)^{2}\right]}\times \exp(-\frac{8}{9})\sqrt{\frac{2}{5}}\frac{1}{\sqrt{2 \pi}} \sqrt{\frac{1}{2}} \exp\left[-\frac{1}{4} y^{2}+\frac{7}{6} y-\frac{49}{36}\right]\\
    &=\frac{1}{2 \pi} \sqrt{\frac{1}{6}}\exp(-\frac{8}{9})\frac{1}{\sqrt{\frac{5}{6}}+\exp \left[-\frac{1}{12}(y-1)^{2}\right]}\exp \left\{-\frac{5}{12}(y-1)^{2}-\frac{1}{4} y^{2}+\frac{7}{6} y-\frac{49}{36}\right\}\\
    &=\frac{1}{2 \pi} \sqrt{\frac{1}{6}}\frac{1}{\sqrt{\frac{5}{6}}+\exp \left[-\frac{1}{12}(y-1)^{2}\right]}\exp \left\{-\frac{2}{3} y^{2}+2 y-\frac{8}{3}\right\}.
    \end{align*}
    
    \item \underline{Missingness pattern ($R_x=0,\, R_y=0$).} We need to prove
    $$\int p_1(x, y, R_x=0,R_y=0) dxdy = \int p_2(x, y, R_x=0,R_y=0) dxdy,$$ 
    which is guaranteed to hold since  the previous three missingness patterns yield the same observed data law and the fact that probabilities should integrate to one. 
\end{enumerate}

This concludes the claim that the target law is not identified in the criss-cross MNAR model. 

%##############################################
\newpage
\section{Identification Proofs}
\label{app:id-proofs} 

\subsection{Theorem~\ref{thm:id-par} \quad {\small (Target law parametric identification: univariate $X$)}}\label{app:sub-target-id-uni}

We have 
\vspace{-0.5cm}
\begin{align*}
    X &\sim \exp\left\{\frac{x\eta_x-b_x(\eta_x)}{\Phi_x}+c_x(x;\; \Phi_x)\right\}
    \\
    Y\mid X &\sim \exp\left\{\frac{y\eta-b(\eta)}{\Phi}+c(y;\; \Phi)\right\}, \quad g(\mu(\eta)) = \alpha + \beta x. 
\end{align*}

The parameters of interest are $\theta=(\alpha,\beta,\Phi,\eta_x,\Phi_x)$. Since $p(x\mid y)$ is nonparametrically (np)-identified, we can select two distinct points of $X$, say $x_1$ and $x_0$ and write
\begin{align*}
    \frac{p(x_1\mid y)}{p(x_0\mid y)}&={\frac{p(y\mid x_1)p(x_1)}{p(y)}} \div {\frac{p(y\mid x_0)p(x_0)}{p(y)}}
    =\frac{p(y\mid x_1)}{p(y\mid x_0)}\times\frac{p(x_1)}{p(x_0)}\\
    &=\exp\left\{ \frac{y(\eta_1-\eta_0)-[b(\eta_1)-b(\eta_0)]}{\Phi} \right\}\times\exp\left\{\frac{\eta_x(x_1-x_0)}{\Phi_x}+c(x_1;\;\Phi_x)-c(x_0;\;\Phi_x)\right\}. 
\end{align*}
We take a $log$ on both sides. The left-hand side is only a function of $y$. Suppose the coefficient of $y$ on the left-hand side is $\phi_1$ and the intercept is $\zeta_1$. For the ease of notation, define $\varphi=[g\circ \mu]^{-1}$ and $\zeta=b([g\circ \mu]^{-1})$. We can then write the following: 
$$\begin{aligned}
\phi_1(\theta) &= \frac{\eta_1-\eta_0}{\Phi}=\frac{[g\circ \mu]^{-1}(\alpha+x_1\beta)-[g\circ \mu]^{-1}(\alpha+x_0\beta)}{\Phi}=\frac{\varphi(\alpha+x_1\beta)-\varphi(\alpha+x_0\beta)}{\Phi}\\
\zeta_1(\theta) &= \left\{ \frac{-[b(\eta_1)-b(\eta_0)]}{\Phi}+\frac{\eta_x(x_1-x_0)}{\Phi_x}+c(x_1;\;\Phi_x)-c(x_0;\;\Phi_x)\right\}\\
&=\left\{ \frac{-\left[b\left([g\circ \mu]^{-1}(\alpha+x_1\beta)\right)-b\left([g\circ \mu]^{-1}(\alpha+x_0\beta)\right)\right]}{\Phi}+\frac{\eta_x(x_1-x_0)}{\Phi_x}+c(x_1;\;\Phi_x)-c(x_0;\;\Phi_x)\right\}\\
&=\left\{ \frac{-\zeta(\alpha+x_1\beta)+\zeta(\alpha+x_0\beta)}{\Phi}+\frac{\eta_x(x_1-x_0)}{\Phi_x}+c(x_1;\;\Phi_x)-c(x_0;\;\Phi_x)\right\}. 
\end{aligned}$$
Suppose we have $k+1$ distinct values of $x$. We can then create $2k$ equations like above, say $\phi_i$ and $\zeta_i$ with $i=1,\dots, k$. The core of our identification proof relies on the \textit{implicit function theorem}. In order to use this theorem, the above equations need to satisfy the followings: 
\begin{enumerate}
    \item There exists at least one solution $\theta_0$ that satisfies the above equations, 
    \item $\phi_i(\theta)$ and $\zeta_i(\theta)$ are continuous in $\Theta$, i.e., the parameter space with $\theta_0$ as an inner point, 
    \item $\phi_i(\theta)$ and $\zeta_i(\theta)$ are first order partially differentiable in $\Theta$, 
    \item Let $\Phi=\{\phi_1,\dots, \phi_k\}$ and $Z=\{\zeta_1,\dots,\zeta_k\}$. Define the Jacobian matrix $J$ as $J=\frac{\partial(\Phi,\,Z)}{\partial(\theta)}$, which is described below: 

    $$
    \resizebox{\textwidth}{!}{
    J=\begin{bmatrix}
    \varphi^{\prime}\left(\alpha+x_{1} \beta\right)-\varphi^{\prime}\left(\alpha+x_{0} \beta\right) & \varphi^{\prime}\left(\alpha+x_{1} \beta\right) x_{1}-\varphi^{\prime}\left(\alpha+x_{0} \beta\right) x_{0} & \varphi\left(\alpha+x_{1} \beta\right)-\varphi\left(\alpha+x_{0} \beta\right) & 0 & 0\\
    \vdots & \vdots & \vdots & \vdots & \vdots \\
    \varphi^{\prime}\left(\alpha+x_{k} \beta\right)-\varphi^{\prime}\left(\alpha+x_{0} \beta\right) & \varphi^{\prime}\left(\alpha+x_{k} \beta\right) x_{k}-\varphi^{\prime}\left(\alpha+x_{0} \beta\right) x_{0} & \varphi\left(\alpha+x_{k} \beta\right)-\varphi\left(\alpha+x_{0} \beta\right) & 0 & 0\\
    \zeta^{\prime}\left(\alpha+x_{1} \beta\right)-\zeta^{\prime}\left(\alpha+x_{0} \beta\right) & \zeta^{\prime}\left(\alpha+x_{1} \beta\right) x_{1}-\zeta^{\prime}\left(\alpha+x_{0} \beta\right) x_{0} & \zeta\left(\alpha+x_{1} \beta\right)-\zeta\left(\alpha+x_{0} \beta\right) & x_{1}-x_{0}& -\frac{\eta_{x}\left(x_{1}-x_{0}\right)}{\Phi^2_x}+\frac{\partial c\left(x_{1} , \Phi_{x}\right)}{\partial \Phi_x}-\frac{\partial c\left(x_{0} , \Phi_{x}\right)}{\partial \Phi_x}\\
    \vdots & \vdots & \vdots & \vdots & \vdots \\
    \zeta^{\prime}\left(\alpha+x_{k} \beta\right)-\zeta^{\prime}\left(\alpha+x_{0} \beta\right) & \zeta^{\prime}\left(\alpha+x_{k} \beta\right) x_{k}-\zeta^{\prime}\left(\alpha+x_{0} \beta\right) x_{0} & \zeta\left(\alpha+x_{k} \beta\right)-\zeta\left(\alpha+x_{0} \beta\right) & x_{k}-x_{0} & -\frac{\eta_{x}\left(x_{k}-x_{0}\right)}{\Phi^2_x}+\frac{\partial c\left(x_{k} , \Phi_{x}\right)}{\partial \Phi_x}-\frac{\partial c\left(x_{0} , \Phi_{x}\right)}{\partial \Phi_x}
    \end{bmatrix}
    }
    $$
    
    $J$ must be of full rank under $\left(\theta_0,\phi_i(\theta_0),\zeta_i(\theta_0)\right)$, 
    
    \item The number of equations must be greater or equal to the number of unknown parameters, i.e.,  $2k\geq dim(\theta)$.
    % (this  is not a condition for the implicit function theorem). 
\end{enumerate}

% \red{Under the above conditions, we can apply the implicit function theorem and find a uniquely defined mapping $g$ such that $$\theta=g\left(\phi_1(\theta),\dots,\phi_k(\theta),\zeta_1(\theta),\dots,\zeta_k(\theta)\right),$$ where $\theta\in \Theta$ and $\left(\phi_1(\theta),\dots,\phi_k(\theta),\zeta_1(\theta),\dots,\zeta_k(\theta)\right)\in U$ with $U=B\left(\phi_1(\theta_0),\dots,\phi_k(\theta_0),\zeta_1(\theta_0),\dots,\zeta_k(\theta_0)\right)$
% given that the $\left(\phi_1,\dots,\phi_k,\zeta_1,\dots,\zeta_k\right)$ we observed is actually under the true value $\theta_0$, which is $$\text{observed }\left(\phi_1,\dots,\phi_k,\zeta_1,\dots,\zeta_k\right)=\left(\phi_1(\theta_0),\dots,\phi_k(\theta_0),\zeta_1(\theta_0),\dots,\zeta_k(\theta_0)\right)$$
% so through $g$, we can uniquely find $\theta_0=g\left(\phi_1(\theta_0),\dots,\phi_k(\theta_0),\zeta_1(\theta_0),\dots,\zeta_k(\theta_0)\right).$
% }

Under the above conditions, there exists neighborhood $U$ around the true parameters $\theta_0$ as $U=B\left(\theta_0, \epsilon\right) \subset \Theta$, and the neighborhood $V$ around $(\phi_i(\theta_0),\zeta_i(\theta_0))$ as $V=B\left((\phi_1(\theta_0),\dots,\phi_k(\theta_0),\zeta_1(\theta_0),\dots,\zeta_k(\theta_0)),\eta\right)\subset R^{2k}$ with $\epsilon, \eta>0$, and uniquely defined functions $g=\left(g_1, \ldots, g_{2 k}\right)$ on $V$ that each $g_i$ is first-order continuously differentiable. We have $$\theta=g\left(\phi_1(\theta),\dots,\phi_k(\theta),\zeta_1(\theta),\dots,\zeta_k(\theta)\right),$$ where $\left(\phi_1(\theta),\dots,\phi_k(\theta),\zeta_1(\theta),\dots,\zeta_k(\theta)\right)\in V$, with $\theta\in U$. Given that the $\left(\phi_1,\dots,\phi_k,\zeta_1,\dots,\zeta_k\right)$ we observed is generated under the true value $\theta_0$, which is
$\operatorname{observed}\left(\phi_1,\dots,\phi_k,\zeta_1,\dots,\zeta_k\right)=\left(\phi_1(\theta_0),\dots,\phi_k(\theta_0),\zeta_1(\theta_0),\dots,\zeta_k(\theta_0)\right)$, by applying $g$, we can uniquely find $\theta_0=g\left(\phi_1(\theta_0),\dots,\phi_k(\theta_0),\zeta_1(\theta_0),\dots,\zeta_k(\theta_0)\right).$

%++++++++++++++++++++++++++++++++++++++
%\newpage
\vspace{0.5cm}
\subsection{Target law parametric identification: multivariate \ $\bf X$}\label{app:sub-target-id-multi}

\subsubsection{Multivariate normal \ $\bf X$}\label{proofs:target_id-multinormal} 

Suppose $$\begin{aligned}
    &X \sim \N_d(\mu, \Sigma)\\
    & Y\mid X \sim \exp\left\{\frac{y\eta-b(\eta)}{\Phi}+c(y;\; \Phi)\right\}, \quad g(\mu(\eta)) = \alpha + x^{T}\beta. 
\end{aligned}$$
Assume the nuisance parameter $\Sigma$ is known and $\theta=(\alpha, \beta, \Phi, \mu)$. We can write down the following equation: 
$$\resizebox{\textwidth}{!}{
  \begin{aligned}
  \frac{p\left(x_1 \mid y\right)}{p\left(x_0 \mid y\right)}&=\frac{p\left(y \mid x_1\right)}{p\left(y \mid x_0\right)} \times \frac{p\left(x_1\right)}{p\left(x_0\right)}\\
  &=\exp \left\{\frac{y\left(\eta_1-\eta_0\right)-\left[b\left(\eta_1\right)-b \left(\eta_0\right)\right]}{\Phi}\right\} \exp \left\{-\frac{1}{2}\left(x_1-\mu\right)^{T} \Sigma^{-1}\left(x_1-\mu\right)+\frac{1}{2}\left(x_0-\mu\right)^{T} \Sigma^{-1}\left(x_0-\mu\right)\right\}. 
  \end{aligned}
  }$$ 
  
Taking a log on both sides yields the following equation: 
$$\resizebox{\textwidth}{!}{
\begin{aligned}
\log \left[p\left(x_1 \mid y\right)\right]-\log\left[p\left(x_0 \mid y\right)\right]&=y \times \frac{\eta_1-\eta_0}{\Phi}-\frac{b\left(\eta_1\right)-b\left(\eta_0\right)}{\Phi}-\frac{1}{2}\left(x_1-\mu\right)^{T} \Sigma^{-1}\left(x_1-\mu\right)+\frac{1}{2}\left(x_0-\mu\right)^{T} \Sigma^{-1}\left(x_0-\mu\right). 
\end{aligned}
}$$

The left-hand side is only a function of $y$. Suppose the coefficient of $y$ is $\phi_1$ and the intercept is $\zeta_1$.  For the ease of notation, define $\varphi=[g\circ \mu]^{-1}$ and $\zeta=b([g\circ \mu]^{-1})$. Then, we obtain the following equation: 
$$\begin{aligned}
\phi_1(\theta)&=\frac{\eta_1-\eta_0}{\Phi}=\frac{\left[g\circ \mu\right]^{-1}\left(\alpha+x_1^{T} \beta\right)-\left[g\circ \mu\right]^{-1}\left(\alpha+x_0^{T} \beta\right)}{\Phi}=\frac{\varphi\left(\alpha+x_1^{T} \beta\right)-\varphi\left(\alpha+x_0^{T} \beta\right)}{\Phi}\\
\zeta_1(\theta) &= -\frac{b\left(\eta_1\right)-b\left(\eta_0\right)}{\Phi}-\frac{1}{2}\left(x_1-\mu\right)^{T} \Sigma^{-1}\left(x_1-\mu\right)+\frac{1}{2}\left(x_0-\mu\right)^{T} \Sigma^{-1}\left(x_0-\mu\right)\\
& =  -\frac{\zeta(\alpha+x_1^{T} \beta)-\zeta(\alpha+x_0^{T} \beta)}{\Phi}-\frac{1}{2}\left(x_1-\mu\right)^{T} \Sigma^{-1}\left(x_1-\mu\right)+\frac{1}{2}\left(x_0-\mu\right)^{T} \Sigma^{-1}\left(x_0-\mu\right). 
\end{aligned}$$

Suppose we have $k+1$ distinct values of $x$. Thus, we can construct $2k$ equations, $\phi_i$ and $\zeta_i$ with $i=1,\dots, k$. In order to use this theorem, the above equations
need to satisfy the followings:
\begin{enumerate}
    \item There exists at least one solution $\theta_0$ that satisfies the above equations, 
    \item $\phi_i(\theta)$ and $\zeta_i(\theta)$ are continuous on $\Theta$, i.e., the parameter space with $\theta_0$ as an inner point, 
    \item $\phi_i(\theta)$ and $\zeta_i(\theta)$ are first order partially differentiable on $\Theta$, 
    \item Let $\Phi=\{\phi_1,\dots, \phi_k\}$ and $Z=\{\zeta_1,\dots,\zeta_k\}$. Define then Jacobian matrix $J$ as $J=\frac{\partial(\Phi,\,Z)}{\partial(\theta)}$, described below: 
    
    $$
    \resizebox{\textwidth}{!}{
    J=\begin{bmatrix}
    \varphi^{\prime}\left(\alpha+x^{T}_1\beta\right)-\varphi^{\prime}\left(\alpha+x^{T}_0\beta\right) & \varphi^{\prime}\left(\alpha+x^{T}_1\beta\right) x^{T}_{1}-\varphi^{\prime}\left(\alpha+x^{T}_0\beta\right) x^{T}_{0} & \varphi\left(\alpha+x^{T}_1\beta\right)-\varphi\left(\alpha+x^{T}_0\beta\right) & 0\\
    \vdots & \vdots & \vdots & \vdots  \\
    \varphi^{\prime}\left(\alpha+x^{T}_{k} \beta\right)-\varphi^{\prime}\left(\alpha+x^{T}_0\beta\right) & \varphi^{\prime}\left(\alpha+x^{T}_{k} \beta\right) x^{T}_{k}-\varphi^{\prime}\left(\alpha+x^{T}_0\beta\right) x^{T}_{0} & \varphi\left(\alpha+x^{T}_{k} \beta\right)-\varphi\left(\alpha+x^{T}_0\beta\right) & 0 \\
    \zeta^{\prime}\left(\alpha+x^{T}_1\beta\right)-\zeta^{\prime}\left(\alpha+x^{T}_0\beta\right) &\zeta^{\prime}\left(\alpha+x^{T}_1\beta\right) x^{T}_{1}-\zeta^{\prime}\left(\alpha+x^{T}_0\beta\right) x^{T}_{0} & \zeta\left(\alpha+x^{T}_1\beta\right)-\zeta\left(\alpha+x^{T}_0\beta\right) & \left(x_1-x_0\right)^{T} \Sigma^{-1}\\
    \vdots & \vdots & \vdots & \vdots \\
    \zeta^{\prime}\left(\alpha+x^{T}_{k} \beta\right)-\zeta^{\prime}\left(\alpha+x^{T}_0\beta\right) & \zeta^{\prime}\left(\alpha+x^{T}_{k} \beta\right) x^{T}_{k}-\zeta^{\prime}\left(\alpha+x^{T}_0\beta\right) x^{T}_{0}& \zeta\left(\alpha+x^{T}_k\beta\right)-\zeta\left(\alpha+x^{T}_0\beta\right) & \left(x_k-x_0\right)^{T} \Sigma^{-1}
    \end{bmatrix}
    }
    $$

    $J$ must be of full rank under $\left(\theta_0,\phi_i(\theta_0),\zeta_i(\theta_0)\right)$, 
    
    \item The number of equations must be greater or equal to the number of unknown parameters, i.e.,  $2k\geq dim(\theta)$.
\end{enumerate}

Under the special case where $Y\mid X\sim \N\left(\alpha+x^{T}\beta,\Phi\right)$, we have: 
$$\begin{aligned}
&\phi_i(\theta)=\frac{\left(x_i-x_0\right)^{T} \beta}{\Phi}\\
&\zeta_i(\theta)=-\frac{\left(\alpha+x_i^{T} \beta\right)^2-\left(\alpha+x_0^{T} \beta\right)^2}{2 \Phi}-\frac{1}{2}\left(x_i-\mu\right)^{T} \Sigma^{-1}\left(x_i-\mu\right)+\frac{1}{2}\left(x_0-\mu\right)^{T} \Sigma^{-1}\left(x_0-\mu\right),\\
&\text{where } i\in (1,\ldots,k), \text{ and }
\end{aligned}$$

$$
\resizebox{\textwidth}{!}{
    J=\begin{bmatrix}
0 & \frac{(x_{1}-x_{0})^{T}}{\Phi} & -\frac{(x_{1}-x_{0})^{T}\beta}{\Phi^2} & 0\\
\vdots & \vdots & \vdots & \vdots  \\
0 & \frac{(x_{k}-x_{0})^{T}}{\Phi} & -\frac{(x_{k}-x_{0})^{T}\beta}{\Phi^2} & 0\\
-\frac{(x_{1}-x_{0})^{T}\beta}{\Phi} & -\frac{\alpha(x_{1}-x_{0})^{T}+\beta^{T}(x_1x^{T}_1-x_0x^{T}_0)}{\Phi} & \frac{(\alpha+x^{T}_1\beta)^2-(\alpha+x^{T}_0\beta)^2}{2\Phi^2} & \left(x_{1}-x_{0}\right)^{T} \Sigma^{-1}\\
\vdots & \vdots & \vdots & \vdots \\
-\frac{(x_{k}-x_{0})^{T}\beta}{\Phi} & -\frac{\alpha(x_{k}-x_{0})^{T}+\beta^{T}(x_kx^{T}_k-x_0x^{T}_0)}{\Phi} & \frac{(\alpha+x^{T}_k\beta)^2-(\alpha+x^{T}_0\beta)^2}{2\Phi^2} & \left(x_{k}-x_{0}\right)^{T} \Sigma^{-1}\\
\end{bmatrix}
    }
    $$
After performing some rank-preserving modifications to this matrix, we have
    $$
    \resizebox{\textwidth}{!}{
        J=\begin{bmatrix}
    0 & (x_{1}-x_{0})^{T} & -(x_{1}-x_{0})^{T}\beta & 0\\
    \vdots & \vdots & \vdots & \vdots  \\
    0 & (x_{1}-x_{0})^{T} & -(x_{1}-x_{0})^{T}\beta & 0\\
    (x_{1}-x_{0})^{T}\beta & -\left[\alpha(x_{1}-x_{0})^{T}+\beta^{T}(x_1x^{T}_1-x_0x^{T}_0)\right] & \frac{(\alpha+x^{T}_1\beta)^2-(\alpha+x^{T}_0\beta)^2}{2} & \left(x_{1}-x_{0}\right)^{T} \Sigma^{-1}\\
    \vdots & \vdots & \vdots & \vdots \\
    (x_{k}-x_{0})^{T}\beta & -\left[\alpha(x_{k}-x_{0})^{T}+\beta^{T}(x_kx^{T}_1-x_0x^{T}_0)\right] & \frac{(\alpha+x^{T}_k\beta)^2-(\alpha+x^{T}_0\beta)^2}{2} & \left(x_{k}-x_{0}\right)^{T} \Sigma^{-1}\\
    \end{bmatrix}
        }
    $$

The dimension of $J$ is $dim(J)=2k\times (2+2d)$. Assume $2k\geq (2+2d)$. A sufficient condition to make $J$ full rank is knowing at least $\alpha$. 

Note that in this example \textbf{$p(X\mid Y)$ is in the exponential family}, since:
$$
\resizebox{\textwidth}{!}{
\begin{aligned}
    p(x \mid y)&=\frac{p(y \mid x) p(x)}{p(y)}\\
    &=\exp \left\{-\frac{\left[y-\left(\alpha+x^{T} \beta\right)\right]^2}{2 \Phi}+\log \frac{1}{\sqrt{2 \pi \Phi}}-\frac{1}{2}(x-\mu)^{T} \Sigma^{-1}(x-\mu)+\log \frac{1}{\sqrt{(2 \pi)^d|\Sigma|}}-\log (y)\right\}\\
    &=\exp\bigg\{-\frac{(y-\alpha)^2}{2 \Phi}+\frac{(y \beta-\alpha \beta)^{T}}{\Phi} x-\frac{\operatorname{tr}\left(\beta \beta^{T} x x^{T}\right)}{2 \Phi}+\log \frac{1}{\sqrt{2 \pi \Phi}}
    +\mu^{T} \Sigma^{-1} x-\frac{1}{2} x^{T} \Sigma^{-1} x-\frac{1}{2} \mu^{T} \Sigma^{-1} \mu+\log \frac{1}{\sqrt{(2 \pi)^{d}|\Sigma|}}-\log (y)\bigg\}\\
    &=\exp\bigg\{ \left[\frac{(y \beta-\alpha \beta)^{T}}{\Phi}+\mu^{T} \Sigma^{-1}, -\frac{\operatorname{vec}\left(\beta \beta^{T}\right)^{T}}{2 \Phi}\right]\left(\begin{array}{c}
x \\
\operatorname{vec}\left(x x^{T}\right)
\end{array}\right)-\frac{(y-\alpha)^2}{2 \Phi}+\log \frac{1}{\sqrt{2 \pi \Phi}}
    -\frac{1}{2} x^{T} \Sigma^{-1} x-\frac{1}{2} \mu^{T} \Sigma^{-1} \mu+\log \frac{1}{\sqrt{(2 \pi)^{d}|\Sigma|}}-\log (y)\bigg\}.
\end{aligned}}$$

Here \textit{tr}(.) denotes the trace of the input matrix and \textit{vec}(.) refers to the vectorization operation applied to the input matrix, e.g., $A_{n\times m}$, as stacking the rows of the matrix one by one to form a long column vector with size $nm\time 1$, i.e., 
$$\operatorname{vec}[A]=\operatorname{vec}\left[\left(\begin{array}{ccc}
a_{11} & \cdots & a_{1 m} \\
\vdots & \ddots & \vdots \\
a_{n 1} & \cdots & a_{n m}
\end{array}\right)\right]=\left(\begin{array}{c}
a_{11} \\
\vdots \\
a_{1 m} \\
\vdots \\
a_{n m}
\end{array}\right).$$

\vspace{0.5cm}
\subsubsection{Multinomial \ $\bf X$}\label{proofs:target_id-multinomial}

Suppose $$\begin{aligned}
&X\sim \operatorname{Multinomial}_d(n,p),\\
&Y\mid X \sim \exp\left\{\frac{y\eta-b(\eta)}{\Phi}+c(y;\; \Phi)\right\}, \quad g(\mu(\eta)) = \alpha + x^{T}\beta,
\end{aligned}$$

where $p=(p_1,\ldots, p_d)$ is the vector of event probabilities, and $n$ is the number of trials. We can write $p(x)=\exp [x^{T} \eta+c(x)]\text { with } \eta=\left(\log p_1, \ldots, \log _{p_d}\right), c(x)=\log \frac{n !}{x_{1} ! \cdots x_{d} !}$. Assume the nuisance parameter $n$ is known and $\theta=(\alpha, \beta, \Phi, \eta)$. We can write down the following:  
\begin{align*}
\frac{p\left(x_1 \mid y\right)}{p\left(x_0 \mid y\right)}&=\frac{p\left(y \mid x_1\right)}{p\left(y \mid x_0\right)} \times\frac{p\left(x_1\right)}{p\left(x_0\right)} \\
& =\exp \left\{\frac{y\left(\eta_1-\eta_0\right)-\left[b\left(\eta_1\right)-b\left(\eta_0\right)\right]}{\Phi}\right\} \exp \left\{\left(x_1-x_0\right)^{T} \eta+c\left(x_1\right)-c\left(x_0\right)\right\}. 
\end{align*}
Taking a $\log$ on both sides yields the following: 
$$\log \left[p\left(x_1 \mid y\right)\right]-\log \left[p\left(x_0 \mid y\right)\right]=y \frac{\eta_1-\eta_0}{\Phi}-\frac{b\left(\eta_1\right)-b\left(\eta_0\right)}{\Phi}+\left(x_1-x_0\right)^{T}\eta+c\left(x_1\right)-c\left(x_0\right)$$
  The left-hand side is only a function of $y$. Suppose the coefficient of $y$ is $\phi_1$ and the intercept is $\zeta_1$.  For the ease of notation, define $\varphi=[g\circ \mu]^{-1}$ and $\zeta=b([g\circ \mu]^{-1})$. Thus, we obtain the following: 
$$\begin{aligned}
\phi_1(\theta)&=\frac{\eta_1-\eta_0}{\Phi}=\frac{\left[g\circ \mu\right]^{-1}\left(\alpha+x_1^{T} \beta\right)-\left[g\circ \mu\right]^{-1}\left(\alpha+x_0^{T} \beta\right)}{\Phi}=\frac{\varphi\left(\alpha+x_1^{T} \beta\right)-\varphi\left(\alpha+x_0^{T} \beta\right)}{\Phi}\\
\zeta_1(\theta) &= -\frac{b\left(\eta_1\right)-b\left(\eta_0\right)}{\Phi}+\left(x_1-x_0\right)^{T}\eta+c\left(x_1\right)-c\left(x_0\right)\\
& =  -\frac{\zeta(\alpha+x_1^{T} \beta)-\zeta(\alpha+x_0^{T} \beta)}{\Phi}+\left(x_1-x_0\right)^{T} \eta+c\left(x_1\right)-c\left(x_0\right). 
\end{aligned}$$

Suppose we have $k+1$ distinct values of $x$. Thus, we can construct $2k$ equations, $\phi_i$ and $\zeta_i$ with $i=1,\dots, k$. To apply the implicit function theorem, the equations need to satisfy the following conditions: 
\begin{enumerate}
    \item There exists at least one solution $\theta_0$ that satisfies the above equations, 
    \item $\phi_i(\theta)$ and $\zeta_i(\theta)$ are continuous on $\Theta$, i.e., the parameter space with $\theta_0$ as an inner point, 
    \item $\phi_i(\theta)$ and $\zeta_i(\theta)$ are first order partially differentiable on $\Theta$, 
    \item Let $\Phi=\{\phi_1,\dots, \phi_k\}$ and $Z=\{\zeta_1,\dots,\zeta_k\}$. Define then Jacobian matrix $J$ as $J=\frac{\partial(\Phi,\,Z)}{\partial(\theta)}$, described below: 
    $$
    \resizebox{\textwidth}{!}{
    J=\begin{bmatrix}
\varphi^{\prime}\left(\alpha+x^{T}_1\beta\right)-\varphi^{\prime}\left(\alpha+x^{T}_0\beta\right) & \varphi^{\prime}\left(\alpha+x^{T}_1\beta\right) x^{T}_{1}-\varphi^{\prime}\left(\alpha+x^{T}_0\beta\right) x^{T}_{0} & \varphi\left(\alpha+x^{T}_1\beta\right)-\varphi\left(\alpha+x^{T}_0\beta\right) & 0\\
\vdots & \vdots & \vdots & \vdots  \\
\varphi^{\prime}\left(\alpha+x^{T}_{k} \beta\right)-\varphi^{\prime}\left(\alpha+x^{T}_0\beta\right) & \varphi^{\prime}\left(\alpha+x^{T}_{k} \beta\right) x^{T}_{k}-\varphi^{\prime}\left(\alpha+x^{T}_0\beta\right) x^{T}_{0} & \varphi\left(\alpha+x^{T}_{k} \beta\right)-\varphi\left(\alpha+x^{T}_0\beta\right) & 0 \\
\zeta^{\prime}\left(\alpha+x^{T}_1\beta\right)-\zeta^{\prime}\left(\alpha+x^{T}_0\beta\right) & \zeta^{\prime}\left(\alpha+x^{T}_1\beta\right) x^{T}_{1}-\zeta^{\prime}\left(\alpha+x^{T}_0\beta\right) x^{T}_{0} & \zeta\left(\alpha+x^{T}_1\beta\right)-\zeta\left(\alpha+x^{T}_0\beta\right) & \left(x_1-x_0\right)^{T} M\\
\vdots & \vdots & \vdots & \vdots \\
\zeta^{\prime}\left(\alpha+x^{T}_{k} \beta\right)-\zeta^{\prime}\left(\alpha+x^{T}_0\beta\right) & \zeta^{\prime}\left(\alpha+x^{T}_{k} \beta\right) x^{T}_{k}-\zeta^{\prime}\left(\alpha+x^{T}_0\beta\right) x^{T}_{0}& \zeta\left(\alpha+x^{T}_k\beta\right)-\zeta\left(\alpha+x^{T}_0\beta\right) & \left(x_k-x_0\right)^{T} M
\end{bmatrix}
    }
    $$
    where $M_{d\times d-1}=\begin{bmatrix}
           \mathbb{I}_{d-1\times d-1} \\
           (-1,-1,\cdots,-1)_{1\times d-1} \\
         \end{bmatrix},\, \mathbb{I}\text{ is the identity matrix}.$ 

    The Jacobian matrix $J$ must be of full rank under $\left(\theta_0,\phi_i(\theta_0),\zeta_i(\theta_0)\right)$. 
    
         
    \item The number of equations must be greater or equal to the number of unknown parameters, i.e.,  $2k\geq dim(\theta)$.
\end{enumerate}

Under the special case where $Y\mid X\sim \N\left(\alpha+x^{T}\beta,\Phi\right)$, we have: 
$$\begin{aligned}
&\phi_i(\theta)=\frac{\left(x_i-x_0\right)^{T} \beta}{\Phi}\\
&\zeta_i(\theta)=-\frac{\left(\alpha+x_i^{T} \beta\right)^2-\left(\alpha+x_0^{T} \beta\right)^2}{2 \Phi}+\left(x_i-x_0\right)^{T} \eta+c\left(x_i\right)-c\left(x_0\right), \quad i\in (1,2,\cdots,k), 
\end{aligned}$$

$$
\resizebox{\textwidth}{!}{
    J=\begin{bmatrix}
0 & \frac{(x_{1}-x_{0})^{T}}{\Phi} & -\frac{(x_{1}-x_{0})^{T}\beta}{\Phi^2} & 0\\
\vdots & \vdots & \vdots & \vdots  \\
0 & \frac{(x_{k}-x_{0})^{T}}{\Phi} & -\frac{(x_{k}-x_{0})^{T}\beta}{\Phi^2} & 0\\
-\frac{(x_{1}-x_{0})^{T}\beta}{\Phi} & -\frac{\alpha(x_{1}-x_{0})^{T}+\beta^{T}(x_1x^{T}_1-x_0x^{T}_0)}{\Phi} & \frac{(\alpha+x^{T}_1\beta)^2-(\alpha+x^{T}_0\beta)^2}{2\Phi^2} & \left(x_{1}-x_{0}\right)^{T}M\\
\vdots & \vdots & \vdots & \vdots \\
-\frac{(x_{k}-x_{0})^{T}\beta}{\Phi} & -\frac{\alpha(x_{k}-x_{0})^{T}+\beta^{T}(x_kx^{T}_k-x_0x^{T}_0)}{\Phi} & \frac{(\alpha+x^{T}_k\beta)^2-(\alpha+x^{T}_0\beta)^2}{2\Phi^2} & \left(x_{k}-x_{0}\right)^{T}M\\
\end{bmatrix}
    }
    $$
After performing some rank-preserving modifications to this matrix, we get: 
$$
\resizebox{\textwidth}{!}{
    J=\begin{bmatrix}
0 & (x_{1}-x_{0})^{T} & -(x_{1}-x_{0})^{T}\beta & 0\\
\vdots & \vdots & \vdots & \vdots  \\
0 & (x_{k}-x_{0})^{T} & -(x_{k}-x_{0})^{T}\beta & 0\\
(x_{1}-x_{0})^{T}\beta & -\left[\alpha(x_{1}-x_{0})^{T}+\beta^{T}(x_1x^{T}_1-x_0x^{T}_0)\right] & \frac{(\alpha+x^{T}_1\beta)^2-(\alpha+x^{T}_0\beta)^2}{2} & \left(x_{1}-x_{0}\right)^{T}M\\
\vdots & \vdots & \vdots & \vdots \\
(x_{k}-x_{0})^{T}\beta & -\left[\alpha(x_{k}-x_{0})^{T}+\beta^{T}(x_kx^{T}_k-x_0x^{T}_0)\right] & \frac{(\alpha+x^{T}_k\beta)^2-(\alpha+x^{T}_0\beta)^2}{2} & \left(x_{k}-x_{0}\right)^{T}M\\
\end{bmatrix}
    }
    $$

The dimension of $J$ is $dim(J)=2k\times (1+2d)$. Assume $2k\geq (1+2d)$. A sufficient condition to make $J$ full rank is knowing $\alpha$ or at least one element of $\eta$. 

Note that in this example, \textbf{$p(X \mid Y)$ is in the exponential family}, since:
\begin{align*}
    p(x \mid y)&=\frac{p(y \mid x) p(x)}{p(y)}\\
    &=\exp \left\{-\frac{\left[y-\left(\alpha+x^{T} \beta\right)\right]^2}{2 \Phi}+\log \frac{1}{\sqrt{2 \pi \Phi}}+x^{T} \eta+c(x)-\log p(y)\right\}\\
    &=\exp\left\{\left[\frac{(y \beta-\alpha \beta)^{T}}{\Phi}+\eta^{T}, -\frac{\operatorname{vec}\left(\beta \beta^{T}\right)^{T}}{2 \Phi}\right]\left(\begin{array}{c}
x \\
\operatorname{vec}\left(x x^{T}\right)
\end{array}\right)-\frac{(y-\alpha)^2}{2 \Phi}+c(x)-\log p(y)\right\}. 
\end{align*}   

\vspace{0.5cm}
\subsection{Lemma~\ref{lemma:full_law} \quad {\small (Full law identification)}}\label{app:sub-full-id}

Using the DAG factorization we have
$$p\left(X, Y, R_x=1, R_y=1\right)= p(X, Y) \ p(R_x=1 \mid Y) \ p(R_y=1 \mid X, R_x=1).$$

Given the above relation and the fact that the target law $p(X,Y)$ is identified, it is straightforward to conclude that 
 $p(R_x\mid Y)$ is also identified. We now prove under the completeness condition, $p(R_y\mid X,R_x)$ is also identified. Therefore the full law is identified.
The full observed data law can be written down as follows: 
\begin{align*}
	\mathcal{L}_\text{full}(Z_\text{obs}, R; \theta, \psi) 
	&= \prod_{R_x=1, R_y=1} p(X, Y, R_x=1, R_y=1) \times \prod_{R_x=1, R_y=0}  \int p(X, Y, R_x=1, R_y=0) dy \\
	&\ \times \prod_{R_x=0, R_y=1} \int p(X, Y, R_x=0, R_y=1) dx \times \prod_{R_x=0, R_y=0}  \int p(X, Y, R_x=0, R_y=0) dxdy.
 \end{align*}
Given the fact that $p(X,Y)$, $p(R_x=1\mid Y)$, and $p(R_x=0,R_y=0)$ are all identified, the following would stay the same across different models: 

{\small 
\begin{align*}
    &\prod_{R_x=1, R_y=1} p(X, Y, R_x=1, R_y=1) \times \prod_{R_x=1, R_y=0}  \int p(X, Y, R_x=1, R_y=0) dy \times \prod_{R_x=0, R_y=0}  \int p(X, Y, R_x=0, R_y=0) dxdy. 
\end{align*}
}
 
Suppose there exist $p_1(Ry\mid X, R_x)$ and $p_2(Ry\mid X, R_x)$ such that 
$$\int p(X, Y)p(R_x=0\mid Y)p_1(R_y=1\mid R_x=0,X) dx=\int p(X, Y)p(R_x=0\mid Y)p_2(R_y=1\mid R_x=0,X) dx$$
Let $g(X)=p_1(R_y=1\mid R_x=0,X)-p_2(R_y=1\mid R_x=0,X)$, we have
$$p(R_x=0\mid Y = y)\ p(Y = y)\int p(x \mid Y = y) \ g(x) \ dx=0,\, \forall y$$
This must mean that $E[g(X)\mid y]=0,\, \forall y$. 
In our case, $g(X)$ is bounded, thus is with finite mean. Based on the completeness condition, $g(X)=0$ almost surely, which implies $p_1(R_y \mid X, R_x) = p_2(R_y\mid X, R_x)$ almost surely. This concludes that the full law is indeed identified.


%##############################################
\newpage
\section{Examples from the exponential family distributions}\label{app:parID-ex}

In order to better illustrate the implications of Theorem~\ref{thm:id-par}, we provide explicit sufficient identification conditions in a variety of examples in the class of exponential family distributions. In all subsequent examples, we assume that if $X$ is continuous, a sufficient number of unique $X$ values have been observed such that the first condition in Theorem~\ref{thm:id-par}, namely that $k \geq dim(\theta)$, is satisfied. If $X$ is discrete, it is assumed that every category of $X$ is observed in the sample. 

%%%%%%%%% Example 1: Bivariate Normal

\subsection{$X$ and $Y$ are bivariate normal}\label{app:parID-ex-bivar}

Suppose  
$$\left(\begin{array}{l} Y \\ X\end{array}\right) \quad \sim \quad \N\left[\left(\begin{array}{l}\mu_{1} \\ \mu_{2}\end{array}\right), 
\left(\begin{array}{cc}\sigma_{1}^{2} & \rho \sigma_{1} \sigma_{2} \\ \rho \sigma_{1} \sigma_{2} & \sigma_{2}^{2} \end{array}\right)\right].$$

According to Theorem~\ref{thm:id-par},  $p(X,Y)$ is identifiable if at least $\mu_1$ or $\mu_2$ is known, in addition to knowing at least one more parameter in $\{\sigma_1, \sigma_2, \rho\}$. As special cases, when either the marginal distribution of $X$ or $Y$ is known, we can identify $p(X,Y)$.

The above claim can be proven as follows. First, we note that $p(X \mid Y)$ also follows a normal distribution: 
\begin{align*}
    X \mid Y \sim \N\left[\mu_{2}+\rho \frac{\sigma_{2}}{\sigma_{1}}\left(y-\mu_{1}\right),\left(1-\rho^{2}\right) \sigma_{2}^{2}\right]. 
\end{align*}
Since $p(X \mid Y)$ is nonparametrically identified, it means the mean and variance are both identifiable, i.e., $\mu_{2}+\rho \frac{\sigma_{2}}{\sigma_{1}}\left(y-\mu_{1}\right)$ and $\left(1-\rho^{2}\right) \sigma_{2}^{2}$. Thus the following three parameters are identified: 
\begin{align*}
    \mu_{2}-\rho \frac{\sigma_{2}}{\sigma_{1}}\mu_{1},\quad\rho \frac{\sigma_{2}}{\sigma_{1}},\quad \left(1-\rho^{2}\right) \sigma_{2}^{2}
\end{align*}
Let $\theta=(\mu_1,\mu_2,\sigma_1,\sigma_2,\rho)$. By taking derivative with respect to $\theta$, we obtain the following Jacobian matrix: 
\begin{align*}
    J=\left[\begin{array}{ccccc}-\rho \frac{\sigma_{2}}{\sigma_{1}} & 1 & \rho \frac{\sigma_{2}}{\sigma_{1}^{2}} \mu_{1} & -\rho \frac{1}{\sigma_{1}} \mu_{1} & -\frac{\sigma_{2}}{\sigma_{1}} \mu_{1} \\ 0 & 0 & -\rho \frac{\sigma_{2}}{\sigma_{1}^{2}} \mu_{1} & \rho \frac{1}{\sigma_{1}} & \frac{\sigma_{2}}{\sigma_{1}} \\ 0 & 0 & 0 & 2\left(1-\rho^{2}\right) \sigma_{2} & -2 \rho \sigma_{2}^{2}\end{array}\right]
\end{align*}
The number of unknown parameters is greater than the number of equations. To establish target law identification, we need to assume two of the five parameters are known. However, not every pair of parameters will be useful in establishing identification. We go over different options one by one: ($|J|$ denotes the determinant of matrix $J$.)
\begin{enumerate}
    \item Assume $\mu_1, \mu_2$ are known, then $|J|\neq 0\Longrightarrow \text{target law is identified}$
    $$J=\left[\begin{array}{ccc} \rho \frac{\sigma_{2}}{\sigma_{1}^{2}} \mu_{1} & -\rho \frac{1}{\sigma_{1}} \mu_{1} & -\frac{\sigma_{2}}{\sigma_{1}} \mu_{1} \\  -\rho \frac{\sigma_{2}}{\sigma_{1}^{2}} \mu_{1} & \rho \frac{1}{\sigma_{1}} & \frac{\sigma_{2}}{\sigma_{1}} \\  0 & 2\left(1-\rho^{2}\right) \sigma_{2} & -2 \rho \sigma_{2}^{2}\end{array}\right]$$
    
    \item Assume $\mu_1, \sigma_1$ are known, then $|J|\neq 0\Longrightarrow \text{target law is identified}$
    $$J=\left[\begin{array}{ccc} 1 &  -\rho \frac{1}{\sigma_{1}} \mu_{1} & -\frac{\sigma_{2}}{\sigma_{1}} \mu_{1} \\  0 &  \rho \frac{1}{\sigma_{1}} & \frac{\sigma_{2}}{\sigma_{1}} \\  0 &  2\left(1-\rho^{2}\right) \sigma_{2} & -2 \rho \sigma_{2}^{2}\end{array}\right]$$
    
    \item Assume $\mu_1, \sigma_2$ are known, then $|J|\neq 0\Longrightarrow \text{target law is identified}$
    $$J=\left[\begin{array}{ccc} 1 & \rho \frac{\sigma_{2}}{\sigma_{1}^{2}} \mu_{1} &  -\frac{\sigma_{2}}{\sigma_{1}} \mu_{1} \\  0 & -\rho \frac{\sigma_{2}}{\sigma_{1}^{2}} \mu_{1} &  \frac{\sigma_{2}}{\sigma_{1}} \\ 0 & 0 &  -2 \rho \sigma_{2}^{2}\end{array}\right]$$
    
    \item Assume $\mu_1, \rho$ are known, then $|J|\neq 0\Longrightarrow \text{target law is identified}$
    $$J=\left[\begin{array}{ccc} 1 & \rho \frac{\sigma_{2}}{\sigma_{1}^{2}} \mu_{1} & -\rho \frac{1}{\sigma_{1}} \mu_{1}  \\  0 & -\rho \frac{\sigma_{2}}{\sigma_{1}^{2}} \mu_{1} & \rho \frac{1}{\sigma_{1}} \\  0 & 0 & 2\left(1-\rho^{2}\right) \sigma_{2} \end{array}\right]$$
    
    \item Assume $\mu_2, \sigma_1$ are known, then $|J|\neq 0\Longrightarrow \text{target law is identified}$
    $$J=\left[\begin{array}{ccc} -\rho \frac{\sigma_{2}}{\sigma_{1}} &  -\rho \frac{1}{\sigma_{1}} \mu_{1} & -\frac{\sigma_{2}}{\sigma_{1}} \mu_{1} \\ 0 &  \rho \frac{1}{\sigma_{1}} & \frac{\sigma_{2}}{\sigma_{1}} \\ 0 & 2\left(1-\rho^{2}\right) \sigma_{2} & -2 \rho \sigma_{2}^{2}\end{array}\right]$$
    
    \item Assume $\mu_2, \sigma_2$ are known, then $|J|\neq 0\Longrightarrow \text{target law is identified}$ 
    $$J=\left[\begin{array}{ccc}-\rho \frac{\sigma_{2}}{\sigma_{1}} &  \rho \frac{\sigma_{2}}{\sigma_{1}^{2}} \mu_{1}  & -\frac{\sigma_{2}}{\sigma_{1}} \mu_{1} \\ 0 &  -\rho \frac{\sigma_{2}}{\sigma_{1}^{2}} \mu_{1}  & \frac{\sigma_{2}}{\sigma_{1}} \\ 0 &  0 &  -2 \rho \sigma_{2}^{2}\end{array}\right]$$

    This recovers the case studied in \cite{zhao2015semiparametric}. 
    
    \item Assume $\mu_2, \rho$ are known, then $|J|\neq 0\Longrightarrow \text{target law is identified}$
    $$J=\left[\begin{array}{ccc}-\rho \frac{\sigma_{2}}{\sigma_{1}} &  \rho \frac{\sigma_{2}}{\sigma_{1}^{2}} \mu_{1} & -\rho \frac{1}{\sigma_{1}} \mu_{1}  \\ 0 &  -\rho \frac{\sigma_{2}}{\sigma_{1}^{2}} \mu_{1} & \rho \frac{1}{\sigma_{1}}  \\ 0 &  0 & 2\left(1-\rho^{2}\right) \sigma_{2} \end{array}\right]$$
    
    \item Assume $\sigma_1, \sigma_2$ are known, then $|J|= 0\Longrightarrow \text{target law is \textbf{not} identified}$
    $$J=\left[\begin{array}{ccc}-\rho \frac{\sigma_{2}}{\sigma_{1}} & 1 &  -\frac{\sigma_{2}}{\sigma_{1}} \mu_{1} \\ 0 & 0 &  \frac{\sigma_{2}}{\sigma_{1}} \\ 0 & 0 &  -2 \rho \sigma_{2}^{2}\end{array}\right]$$
    
    \item Assume $\sigma_1, \rho$ are known, then $|J|= 0\Longrightarrow \text{target law \textbf{is not} identified}$
    $$J=\left[\begin{array}{ccc}-\rho \frac{\sigma_{2}}{\sigma_{1}} & 1 & -\rho \frac{1}{\sigma_{1}} \mu_{1} \\ 0 & 0 &  \rho \frac{1}{\sigma_{1}}  \\ 0 & 0  & 2\left(1-\rho^{2}\right) \sigma_{2} \end{array}\right]$$
    
    \item Assume $\sigma_2, \rho$ are known, then $|J|= 0\Longrightarrow \text{target law \textbf{is not} identified}$
    $$J=\left[\begin{array}{ccc}-\rho \frac{\sigma_{2}}{\sigma_{1}} & 1 &\rho \frac{\sigma_{2}}{\sigma_{1}^{2}} \mu_{1} \\ 0 & 0 & -\rho \frac{\sigma_{2}}{\sigma_{1}^{2}} \mu_{1} \\ 0 & 0 & 0 \end{array}\right]$$
\end{enumerate}

This concludes that under the bivariate normal distribution, the target law is identified if either $\mu_1$ or $\mu_2$ is known, in addition to knowing at least one more parameter in $\{\sigma_1, \sigma_2, \rho\}$.

It is straightforward to show that \textbf{$p(X\mid Y)$ lies in the exponential family}.


%%%%%%%%% Example 2: Normal inverse link

\subsection{$X$ and $Y\mid X$ are normal under inverse link}\label{app:parID-ex-normal-inverse}

Suppose 
\begin{align*}
    X \sim \N\left(\mu, \phi_x\right), \qquad 
Y \mid X \sim \N\left((\alpha+\beta x)^{-1}, \phi\right). 
\end{align*} 

According to Theorem~\ref{thm:id-par}, $p(X,Y)$ is identifiable without any additional assumptions on the unknown parameter vector $\theta=(\alpha,\beta,\phi,\mu,\phi_x)$. This can be proven as follows: based on Theorem~\ref{thm:id-par}, we have the following equations,
\begin{align*}
\phi_i(\theta) &=\frac{\left(\alpha+\beta x_{i}\right)^{-1}-\left(\alpha+\beta x_{0}\right)^{-1}}{\phi} \\ 
\zeta_i(\theta) &=\left\{-\frac{b\left[\left(\alpha+\beta x_{i}\right)^{-1}\right]-b\left[\left(\alpha+\beta x_{0}\right)^{-1}\right]}{\phi}+\frac{\mu\left(x_{i}-x_{0}\right)}{\phi_{x}}+c\left(x_{i}, \phi_{x}\right)-c\left(x_{0}, \phi_{x}\right)\right\} \\ 
&=-\frac{\left(\alpha+\beta x_{i}\right)^{-2}-\left(\alpha+\beta x_{0}\right)^{-2}}{2 \phi}+\frac{\mu\left(x_{i}-x_{0}\right)}{\phi_{x}}-\frac{x_{i}^{2}-x_{0}^{2}}{2 \phi_{x}}, \quad 
\text{where } i\in (1,\ldots, k). 
\end{align*} 

The Jacobian matrix is as follows: 

$$\begin{bmatrix}
-\frac{\left(\alpha+\beta x_{1}\right)^{-2}-\left(\alpha+\beta x_{0}\right)^{-2}}{\phi} & -\frac{\left(\alpha+\beta x_{1}\right)^{-2} x_{1}-\left(\alpha+\beta x_{0}\right)^{-2} x_{0}}{\phi} & -\frac{\left(\alpha+\beta x_{1}\right)^{-1}-\left(\alpha+\beta x_{0}\right)^{-1}}{\phi^{2}} & 0 & 0\\
% -\frac{\left(\alpha+\beta x_{2}\right)^{-2}-\left(\alpha+\beta x_{0}\right)^{-2}}{\phi} & -\frac{\left(\alpha+\beta x_{2}\right)^{-2} x_{2}-\left(\alpha+\beta x_{0}\right)^{-2} x_{0}}{\phi} & -\frac{\left(\alpha+\beta x_{2}\right)^{-1}-\left(\alpha+\beta x_{0}\right)^{-1}}{\phi^{2}} & 0 & 0\\
\vdots & & & & \vdots\\
-\frac{\left(\alpha+\beta x_{k}\right)^{-2}-\left(\alpha+\beta x_{0}\right)^{-2}}{\phi} & -\frac{\left(\alpha+\beta x_{k}\right)^{-2} x_{k}-\left(\alpha+\beta x_{0}\right)^{-2} x_{0}}{\phi} & -\frac{\left(\alpha+\beta x_{k}\right)^{-1}-\left(\alpha+\beta x_{0}\right)^{-1}}{\phi^{2}} & 0 & 0\\
2 \frac{\left(\alpha+\beta x_{1}\right)^{-3}-\left(\alpha+\beta x_{0}\right)^{-3}}{2 \phi} & 2 \frac{\left(\alpha+\beta x_{1}\right)^{-3} x_{1}-\left(\alpha+\beta x_{0}\right)^{-3} x_{0}}{2 \phi} & \frac{\left(\alpha+\beta x_{1}\right)^{-2}-\left(\alpha+\beta x_{0}\right)^{2}}{2 \phi^2} & \frac{x_{1}-x_{0}}{\phi_{x}} & \frac{\left(x_{1}-x_{0}\right)\left(x_{1}+x_{0}-2 \mu\right)}{2 \phi_{x}^{2}}\\
% 2 \frac{\left(\alpha+\beta x_{2}\right)^{-3}-\left(\alpha+\beta x_{0}\right)^{-3}}{2 \phi} & 2 \frac{\left(\alpha+\beta x_{2}\right)^{-3} x_{2}-\left(\alpha+\beta x_{0}\right)^{-3} x_{0}}{2 \phi} & \frac{\left(\alpha+\beta x_{2}\right)^{-2}-\left(\alpha+\beta x_{0}\right)^{2}}{2 \phi^2} & \frac{x_{2}-x_{0}}{\phi_{x}} & \frac{\left(x_{2}-x_{0}\right)\left(x_{2}+x_{0}-2 \mu\right)}{2 \phi_{x}^{2}}\\
\vdots & & & & \vdots\\
2 \frac{\left(\alpha+\beta x_{k}\right)^{-3}-\left(\alpha+\beta x_{0}\right)^{-3}}{2 \phi} & 2 \frac{\left(\alpha+\beta x_{k}\right)^{-3} x_{k}-\left(\alpha+\beta x_{0}\right)^{-3} x_{0}}{2 \phi} & \frac{\left(\alpha+\beta x_{k}\right)^{-2}-\left(\alpha+\beta x_{0}\right)^{2}}{2 \phi^2} & \frac{x_{k}-x_{0}}{\phi_{x}} & \frac{\left(x_{k}-x_{0}\right)\left(x_{k}+x_{0}-2 u\right)}{2 \phi_{x}^{2}}
\end{bmatrix}$$

After performing some rank-preserving modifications to this matrix, we get: 

$$\resizebox{\textwidth}{!}{\begin{bmatrix}
\left(\alpha+\beta x_{1}\right)^{-2}-\left(\alpha+\beta x_{0}\right)^{-2} & \left(\alpha+\beta x_{1}\right)^{-2} x_{1}-\left(\alpha+\beta x_{0}\right)^{-2} x_{0} &\left(\alpha+\beta x_{1}\right)^{-1}-\left(\alpha+\beta x_{0}\right)^{-1} & 0 & 0\\
% \left(\alpha+\beta x_{2}\right)^{-2}-\left(\alpha+\beta x_{0}\right)^{-2} & \left(\alpha+\beta x_{2}\right)^{-2} x_{2}-\left(\alpha+\beta x_{0}\right)^{-2} x_{0} &\left(\alpha+\beta x_{2}\right)^{-1}-\left(\alpha+\beta x_{0}\right)^{-1} & 0 & 0\\
\vdots & & & & \vdots\\
\left(\alpha+\beta x_{k}\right)^{-2}-\left(\alpha+\beta x_{0}\right)^{-2} & \left(\alpha+\beta x_{k}\right)^{-2} x_{k}-\left(\alpha+\beta x_{0}\right)^{-2} x_{0} &\left(\alpha+\beta x_{k}\right)^{-1}-\left(\alpha+\beta x_{0}\right)^{-1} & 0 & 0\\
\left(\alpha+\beta x_{1}\right)^{-3}-\left(\alpha+\beta x_{0}\right)^{-3} & \left(\alpha+\beta x_{1}\right)^{-3} x_1 -\left(\alpha+\beta x_{0}\right)^{-3} x_0 & \frac{1}{2}\left[\left(\alpha+\beta x_{1}\right)^{-2}-\left(\alpha+\beta x_{0}\right)^{-2}\right] & x_1-x_0 & \left(x_{1}-x_{0}\right)\left(x_{1}+x_{0}-2 \mu\right)\\
% \left(\alpha+\beta x_{2}\right)^{-3}-\left(\alpha+\beta x_{0}\right)^{-3} & \left(\alpha+\beta x_{2}\right)^{-3} x_2 -\left(\alpha+\beta x_{0}\right)^{-3} x_0 & \frac{1}{2}\left[\left(\alpha+\beta x_{2}\right)^{-2}-\left(\alpha+\beta x_{0}\right)^{-2}\right] & x_2-x_0 & \left(x_{2}-x_{0}\right)\left(x_{2}+x_{0}-2 \mu\right)\\
\vdots & & & & \vdots\\
\left(\alpha+\beta x_{k}\right)^{-3}-\left(\alpha+\beta x_{0}\right)^{-3} & \left(\alpha+\beta x_{k}\right)^{-3} x_k -\left(\alpha+\beta x_{0}\right)^{-3} x_0 & \frac{1}{2}\left[\left(\alpha+\beta x_{k}\right)^{-2}-\left(\alpha+\beta x_{0}\right)^{-2}\right] & x_k-x_0 & \left(x_{k}-x_{0}\right)\left(x_{k}+x_{0}-2 \mu\right)
\end{bmatrix}}$$
which is of full rank. 

It is worth pointing out that unlike the example in (\ref{app:parID-ex-bivar}), \textbf{$p(X\mid Y)$ in this example is not in the exponential family}, since: 
\begin{align*}
    p(x \mid y)&=\frac{p(y \mid x) p(x)}{p(y)}=\frac{\N\left((a+b x)^{-1}, \sigma_{y}^{2}\right) \N\left(\mu, \sigma_{x}^{2}\right)}{p(y)}\\
    &=\exp \left\{-\frac{\left(y-\frac{1}{a+b x}\right)^{2}}{2\sigma_{y}^{2}}+\log \frac{1}{\sqrt{2 \pi} \sigma_{y}}-\frac{(x-\mu)^{2}}{2 \sigma_{x}^{2}}+\log \frac{1}{\sqrt{2 \pi} \sigma_{x}}-\log p(y)\right\}\\
    &=\exp \left\{-\frac{\frac{1}{(a+b x)^{2}}-\frac{2 y}{a+b x}+y^{2}}{2\sigma_{y}^{2}}+\log \frac{1}{\sqrt{2 \pi} \sigma_{y}}-\frac{(x-\mu)^{2}}{2 \sigma_{x}^{2}}+\log \frac{1}{\sqrt{2 \pi} \sigma_{x}}-\log p(y)\right\}. 
\end{align*}


%%%%%%%%% Example 3: Binary

\subsection{$X$ and $Y$ are binary}\label{app:parID-ex-binary}

Suppose $p(X = 0, Y = 1) = p_1$, $p(X = 1, Y = 0) = p_2$, $p(X = 0, Y = 0) = p_3$, and $p(X = 1, Y = 1) = p_4$, where $\sum_{i = 1}^4 p_i=1, p_i \neq 0$. 
%$$\begin{array}{ll|l}X & Y & P \\ \hline 0 & 1 & p_{1} \\ 1 & 0 & p_{2} \\ 0 & 0 & p_{3} \\ 1 & 1 & p_{4}\end{array}$$
The unknown parameters of interest are $\theta=(p_1,p_2,p_3,p_4)$. 

In this binary case, there are at most two distinct values of $X$ as $0$ or $1$. According to Theorem~\ref{thm:id-par}, $p(X,Y)$ is identifiable if any one of $p_i$ is known or marginal distribution of either $X$ or $Y$ is known. 

In order to prove the above claim, we look at two distinct parameterizations of $p(X, Y)$. 

\subsubsection{Parameterization 1}
Suppose $p_1 = p(X = 0, Y = 1)$, $p_2 = p(X = 1, Y = 0)$, $p_3 = p(X = 0, Y = 0)$, $p_4(X = 1, Y = 1)$, $p_i \neq 0,\, i=1,\ldots, 4$. 
% $$\begin{array}{ll|l}X & Y & P \\ \hline 0 & 1 & p_{1} \\ 1 & 0 & p_{2} \\ 0 & 0 & p_{3} \\ 1 & 1 & p_{4}\end{array}$$

Since $p(X \mid Y)$ is nonparametrically identified, we obtain the following three equations with four unknowns: 
\begin{align*}
p(X=1 \mid Y=1) =\frac{p_{4}}{p_{1}+p_{4}},  \qquad  
p(X=1 \mid Y=0) =\frac{p_{2}}{p_{2}+p_{3}}, \qquad
\sum_{i=1}^{4} p_{i} =1
\end{align*}

In order to possibly achieve identification, we need to assume one parameter is known. We consider the four different scenarios one by one. 
\begin{enumerate}
    \item Assume $p_1$ is known, then $|J|\neq 0 \Longrightarrow \text{target law is identified}$
    $$J=\left[\begin{array}{ccc}0 & 0 & \frac{p_{1}}{\left(p_{1}+p_{4}\right)^{2}} \\ \frac{p_{3}}{\left(p_{2}+p_{3}\right)^{2}} & \frac{-p_{2}}{\left(p_{2}+p_{3}\right)^{2}} & 0 \\ 1 & 1 & 1\end{array}\right]$$
    
    \item Assume $p_2$ is known, then $|J|\neq 0 \Longrightarrow \text{target law is identified}$
    $$J=\left[\begin{array}{ccc}\frac{-p_{4}}{\left(p_{1}+p_{4}\right)^{2}} & 0 & \frac{p_{1}}{\left(p_{1}+p_{4}\right)^{2}} \\ 0 & \frac{p_{3}}{\left(p_{2}+p_{3}\right)^{2}} & 0 \\ 1 & 1 & 1\end{array}\right]$$
    
    \item Assume $p_3$ is known, then $|J|\neq 0 \Longrightarrow \text{target law is identified}$
    $$J=\left[\begin{array}{ccc}\frac{-p_{4}}{\left(p_{1}+p_{4}\right)^{2}} & 0 & \frac{p_{1}}{\left(p_{1}+p_{4}\right)^{2}} \\ 0 & \frac{p_{3}}{\left(p_{2}+p_{3}\right)^{2}} & 0 \\ 1 & 1 & 1\end{array}\right]$$
    
    \item Assume $p_4$ is known, then $|J|\neq 0 \Longrightarrow \text{target law is identified}$
    $$J=\left[\begin{array}{ccc}\frac{-p_{4}}{\left(p_{1}+p_{4}\right)^{2}} & 0 & 0 \\ 0 & \frac{p_{3}}{\left(p_{2}+p_{3}\right)^{2}} & \frac{-p_{2}}{\left(p_{2}+p_{3}\right)^{2}} \\ 1 & 1 & 1\end{array}\right]$$
\end{enumerate}
In the binary case, it is also useful to assume
\begin{enumerate}
    \item Assume $p(Y=1)=p_1+p_4$ is known, then $|J|\neq 0 \Longrightarrow \text{target law is identified}$
    $$J=\left[\begin{array}{cccc}-\frac{p_{4}}{\left(p_{1}+p_{4}\right)^{2}} & 0 & 0 & \frac{p_{1}}{\left(p_{1}+p_{4}\right)^{2}} \\ 0 & \frac{p_{3}}{\left(p_{2}+p_{3}\right)^{2}} & -\frac{p_{2}}{\left(p_{2}+p_{3}\right)^{2}} & 0 \\ 1 & 1 & 1 & 1 \\ 1 & 0 & 0 & 1\end{array}\right]$$
   
    \item Assume $p(X=1)=p_2+p_4$ is known, then $|J|\neq 0 \Longrightarrow \text{target law is identified}$
    $$J=\left[\begin{array}{cccc}-\frac{p_{4}}{\left(p_{1}+p_{4}\right)^{2}} & 0 & 0 & \frac{p_{1}}{\left(p_{1}+p_{4}\right)^{2}} \\ 0 & \frac{p_{3}}{\left(p_{2}+p_{3}\right)^{2}} & -\frac{p_{2}}{\left(p_{2}+p_{3}\right)^{2}} & 0 \\ 1 & 1 & 1 & 1 \\ 0 & 1 & 0 & 1\end{array}\right]$$
\end{enumerate}

\subsubsection{Parameterization 2}
We can also adopt another parameterization. Suppose 
\begin{align*}
     X \sim \operatorname{Bern}(p),  \qquad Y \mid  X\sim \operatorname{Bern}(a+bX) 
\end{align*}%
More specifically,
\begin{align*}
p(x) &=\exp \left\{x \log \frac{p}{1-p}+\log (1-p)\right\} =\exp \left\{x \cdot \eta_{x}-\log \left(1+e^{\eta_ x}\right)\right\} \quad \text{ where } \eta_x=\log \frac{p}{1-p}  \\ 
p(y\mid x)&=(a+b x)^{y}(1-a-b x)^{1-y} \\ &
=\exp \left\{y \log \frac{a+b x}{1-(a+b x)}+\log [1-(a+b x)]\right\}
\end{align*}
The parameter vector of interest is $\theta=(a,b,\eta_x)$. Based on Theorem~\ref{thm:id-par}, we have the following equations. Note that since $X$ is binary, there are at most two distinct values of $X$. Therefore, we have the following two equations:
\begin{align*}
& \phi_1(\theta)=\log \frac{a+b x_1}{1-\left(a+b x_1\right)}-\log \frac{a+b x_0}{1-\left(a+b x_0\right)} \\
& \zeta_1(\sigma)=\log \left[1-\left(a+b x_1\right)\right]-\log \left[1-\left(a+b x_0\right)\right]+\left(x_1-x_0\right) \eta_x,\text{ where }x_1=1,x_0=0.
\end{align*}
The resulted Jacobian matrix is: 
$$J=\begin{bmatrix}
\frac{1}{(a+b)[1-(a+b)]}-\frac{1}{a(1-a)} & \frac{1}{(a+b)[1-(a+b)]} & 0\\
\frac{-1}{1-(a+b)}+\frac{1}{1-a} & \frac{-1}{1-(a+b)} & x_1-x_0
\end{bmatrix}$$
% After performing some rank-preserving modifications to this matrix, we get: 
% $$\begin{bmatrix}
% \frac{1}{1-(a+b)}-\frac{1}{1-a}+\frac{1}{a+b}-\frac{1}{a} & \frac{1}{1-(a+b)} & 0\\
% \frac{1}{1-(a+b)}-\frac{1}{1-a} & \frac{1}{1-(a+b)} & x_0-x_1
% \end{bmatrix}$$
This concludes that in order to establish target law identification, we need to know at least one parameter in $\{a, b, \eta_x\}$.

It is straightforward to show that \textbf{$p(X\mid Y)$ lies in the exponential family}.


%%%%%%%%%%%% Example 4: X binary, Y normal under canonical link

\subsection{$X$ is binary and $Y\mid X$ is normal under canonical link}\label{app:parID-ex-binary_normal}

Suppose 
\begin{align*}
X \sim \operatorname{Bern}(p), \qquad Y \mid X \sim \N\left(a+b X, \sigma_{y}^{2}\right).  
\end{align*}
More specifically, 
\begin{align*}
&p(x) = p^{x}(1-p)^{1-x}=\exp \left\{x \cdot \log \frac{p}{1-p}+\log (1-p)\right\}=\exp \left\{x \cdot \eta-\log \left(1+e^{\eta}\right)\right\},\text{ where } \eta=\log \frac{p}{1-p} \\ 
&p(y\mid x) =\exp \left\{\frac{y(a+b x)-\frac{1}{2}(a+b x)^{2}}{\phi}+\left[-\frac{y^{2}}{2 \phi}-\frac{1}{2} \log \left(2 \pi \phi\right)\right]\right\}, \text{ where } \phi=\sigma_y^2.  
\end{align*}
The unknown parameter vector of interest is $\theta=(a, b, \phi, \eta)$. According to Theorem~\ref{thm:id-par}, $p(X,Y)$ is identifiable if at either $a$ or $\eta$ is known, in addition to knowing one extra parameter in $\theta$. Knowing $\eta$ is equivalent to knowing $p$. 

In order to prove the above claim, we can construct the following equations: (note that when $X$ is binary, we only have at most two distinct values)
$$\begin{aligned}
  &\phi_{1}(\theta)=\frac{\left(a+b x_{1}\right)-\left(a+b x_{0}\right)}{\phi}=\frac{b\left(x_{i}-x_{0}\right)}{\phi}\\
  &\zeta_{1}(\theta)=-\frac{\left(a+b x_{1}\right)^{2}-\left(a+b x_{0}\right)^{2}}{2 \phi}+\eta\left(x_{1}-x_{0}\right),\text{ where }x_1=1,x_0=0. 
\end{aligned}$$
The Jacobian matrix is: 
$$J=\begin{bmatrix}
0 & \frac{x_{1}-x_{0}}{\phi} & -\frac{b\left( x_{1}-x_{0}\right)}{\phi^{2}} & 0\\
-\frac{b\left(x_{1}-x_{0}\right)}{\phi} & -\frac{a\left(x_{1}-x_{0}\right)+b\left(x_{1}^{2}-x_{0}^{2}\right)}{\phi} & \frac{\left(a+b x_{1}\right)^{2}-\left(a+b x_{0}\right)^{2}}{2 \phi^{2}} & x_{1}-x_{0}
\end{bmatrix}.$$
After some rank-preserving operations, we get: 
$$\begin{bmatrix}
0 & x_1-x_0 & x_1-x_0 & 0\\
1 & a\left(x_{1}-x_{0}\right)+b\left(x_{1}^{2}-x_{0}^{2}\right) & a\left(x_{1}-x_{0}\right)+\frac{b}{2}\left(x_{1}^{2}-x_{0}^{2}\right) & 1
\end{bmatrix}.$$
This concludes the claim that a sufficient set of assumptions for target law identification is knowing either $a$ or $\eta$, in addition to knowing one more parameter in $\theta$. 

Note that in this example, \textbf{$p(X \mid Y)$ is in exponential family} since: 
$$\resizebox{\textwidth}{!}{\begin{aligned}
    p(x \mid y)&=\frac{p(y \mid x) p(x)}{p(y)}=\frac{N_{y}\left(a+b x, \sigma_{y}^{2}\right) p^{x}(1-p)^{1-x}}{p(y)}\\
    &=\exp \left\{-\frac{1}{2}\left(\frac{y-(a+b x)}{\sigma_{y}}\right)^{2}+\log \frac{1}{\sqrt{2 \pi} \sigma_{y}}+x \log p+(1-x) \log (1-p)-\log \left[p(y)\right]\right\} \\
    &=\exp \left\{-\frac{1}{2} \frac{\left(x, x^{2}\right)\left(2 a b-2 b y, b^{2}\right)^{T}+(a-y)^{2}}{\sigma_{y}^{2}}+\log \frac{1}{\sqrt{2 \pi} \sigma_{y}}+x \log p+(1-x) \log (1-p)-\log \left[p(y) \right]\right\} \\
    &=\exp \left\{\left(x, x^{2}\right)\left(-\frac{a b-b y}{\sigma_{y}^{2}}+log(\frac{p}{1-p}), -\frac{b^{2}}{2 \sigma_{y}^{2}}\right)^{T}-\frac{(a-y)^{2}}{2 \sigma_{y}^{2}}+\log \frac{1}{\sqrt{2 \pi} \sigma_{y}}+log(1-p)-\log \left[p(y)\right]\right\}.
\end{aligned}}$$


%%%%%%%%%%%% Example 5: X Poisson, Y normal under canonical link

\subsection{$X$ is Poisson and $Y \mid X$ is normal under canonical link}\label{app:parID-ex-poisson_normal}

Suppose 
\begin{align*}
    X \sim \operatorname{ Poisson }(\lambda), \qquad Y\mid X \sim \N\left(a+b x, \sigma_{y}^{2}\right). 
\end{align*}
More specifically, 
\begin{align*}
    &p(y\mid x)=\exp \left\{\frac{y(a+b x)-\frac{1}{2}(a+b x)^{2}}{\phi}+\left[-\frac{y^{2}}{2 \phi}-\frac{1}{2} \log \left(2 \pi \phi\right)\right]\right\}, \text{ where } \phi=\sigma_y^2 \\
    &p(x=k)=\frac{\lambda^{k} e^{-\lambda}}{k !} =\exp \{k \log \lambda-\lambda-\log k !\} =\exp \left\{k \eta_{x}-e^{\eta_{x}}-\log k !\right\},\text{ where } \eta_{x}=\log \lambda
\end{align*}
The unknown parameter vector of interest is $\theta=\left(a, b, \sigma_y^2, \lambda\right)$. According to Theorem~\ref{thm:id-par}, $p(X,Y)$ is identifiable if either $a$ or $\lambda$ is known. 

In order to prove the above claim, we can construct the following equations: 
\begin{align*}
    &\phi_{i}(\theta)=\frac{\left(a+b x_{i}\right)-\left(a+b x_{0}\right)}{\phi}=\frac{b\left(x_{i}-x_{0}\right)}{\phi} \\ 
    &\zeta_{i}(\theta)=-\frac{\left(a+b x_{i}\right)^{2}-\left(a+b x_{0}\right)^{2}}{2 \phi}+\eta_{x}\left(x_{i}-x_{0}\right)+\left(-\log x_{i} !+\log x_{0} !\right), \quad \text{where } i\in (1,\ldots,k)
\end{align*}
% Note when $X$ follows Poisson distribution, we can take infinite distinct $x$ values.
The Jacobian matrix is then as follows:
$$J=\begin{bmatrix}
0 & \frac{x_{1}-x_{0}}{\phi} & -\frac{\left(b x_{1}-x_{0}\right)}{\phi^{2}} & 0\\
0 & \frac{x_{2}-x_{0}}{\phi} & -\frac{\left(b x_{2}-x_{0}\right)}{\phi^{2}} & 0\\
\vdots & & & \vdots\\
0 & \frac{x_{k}-x_{0}}{\phi} & -\frac{\left(b x_{k}-x_{0}\right)}{\phi^{2}} & 0\\
-\frac{b\left(x_{1}-x_{0}\right)}{\phi} & -\frac{a\left(x_{1}-x_{0}\right)+b\left(x_{1}^{2}-x_{0}^{2}\right)}{\phi} & \frac{\left(a+b x_{1}\right)^{2}-\left(a+b x_{0}\right)^{2}}{2 \phi^{2}} & x_1-x_0\\
-\frac{b\left(x_{2}-x_{0}\right)}{\phi} & -\frac{a\left(x_{2}-x_{0}\right)+b\left(x_{2}^{2}-x_{0}^{2}\right)}{\phi} & \frac{\left(a+b x_{2}\right)^{2}-\left(a+b x_{0}\right)^{2}}{2 \phi^{2}} & x_2-x_0\\
\vdots & & & \vdots\\
-\frac{b\left(x_{k}-x_{0}\right)}{\phi} & -\frac{a\left(x_{k}-x_{0}\right)+b\left(x_{k}^{2}-x_{0}^{2}\right)}{\phi} & \frac{\left(a+b x_{k}\right)^{2}-\left(a+b x_{0}\right)^{2}}{2 \phi^{2}} & x_k-x_0\\
\end{bmatrix}. $$
After some rank-preserving operations, we get: 
$$\begin{bmatrix}
0 & x_1-x_0 & x_1-x_0 & 0\\
0 & x_2-x_0 & x_2-x_0 & 0\\
\vdots & & & \vdots\\
0 & x_k-x_0 & x_k-x_0 & 0\\
x_{1}-x_{0} & a\left(x_{1}-x_{0}\right)+b\left(x_{1}^{2}-x_{0}^{2}\right) & a\left(x_{1}-x_{0}\right)+\frac{b}{2}\left(x_{1}^{2}-x_{0}^{2}\right) & x_{1}-x_{0}\\
x_{2}-x_{0} & a\left(x_{2}-x_{0}\right)+b\left(x_{2}^{2}-x_{0}^{2}\right) & a\left(x_{2}-x_{0}\right)+\frac{b}{2}\left(x_{2}^{2}-x_{0}^{2}\right) & x_{2}-x_{0}\\
\vdots & & & \vdots\\
x_{k}-x_{0} & a\left(x_{k}-x_{0}\right)+b\left(x_{k}^{2}-x_{0}^{2}\right) & a\left(x_{k}-x_{0}\right)+\frac{b}{2}\left(x_{k}^{2}-x_{0}^{2}\right) & x_{k}-x_{0}\\
\end{bmatrix}. $$
We need to know either $a$ or $\eta_x$ to establish identifiability.\\

Note that in this example, \textbf{$p(X\mid Y)$ is in the exponential family} since: 
{\small 
\begin{align*}
    p(x \mid y)&=\frac{p(y \mid x) p(x)}{p(y)}=\frac{N_{y}\left(a+b x, \sigma_{y}^{2}\right) \frac{\lambda^{x} e^{-\lambda}}{x !}}{p(y)}\\
    &=\exp \left\{-\frac{1}{2}\left(\frac{y-(a+b x)}{\sigma_{y}}\right)^{2}+\log \frac{1}{\sqrt{2 \pi} \sigma_{y}}+x \log \lambda-\lambda-\log x !-\log p(y)\right\}\\
    &=\frac{1}{x !} \exp \left\{-\frac{1}{2} \frac{\left(x, x^{2}\right)\left(2 a b-2 b y-2 \sigma_{y}^{2} \log \lambda, b^{2}\right)^{T}+(a-y)^{2}}{\sigma_{y}^{2}}+\log \frac{1}{\sqrt{2 \pi} \sigma_{y}}-\lambda-\log p(y)\right\}\\
    &=\frac{1}{x !} \exp \left\{\left(x, x^{2}\right)\left(-\frac{a b-b y-\sigma_{y}^{2} \log \lambda}{\sigma_{y}^{2}},-\frac{b^{2}}{2 \sigma_{y}^{2}}\right)^{T}-\frac{(a-y)^{2}}{2 \sigma_{y}^{2}}+\log \frac{1}{\sqrt{2 \pi} \sigma_{y}}-\lambda-\log p(y)\right\}. 
\end{align*}
}

%%%%%%%%%%%% Example 6: X exponential, Y normal under canonical link

\subsection{$X$ is exponential and $Y \mid X$ is normal under canonical link}\label{app:parID-ex-exponential_normal}

Suppose 
\begin{align*}
    X \sim \operatorname{exponential} (\lambda), \qquad Y\mid X \sim \N\left(a+b x, \sigma_{y}^{2}\right). 
\end{align*}
More specifically, 
\begin{align*}
    p(x) &=\lambda e^{-\lambda x}=\exp \{-\lambda x+\log \lambda\} \\
    p(y\mid x) &=\exp \left\{\frac{y(a+b x)-\frac{1}{2}(a+b x)^{2}}{\phi}+\left[-\frac{y^{2}}{2 \phi}-\frac{1}{2} \log \left(2 \pi \phi\right)\right]\right\}\quad \text{where } \phi=\sigma_y^2 
\end{align*}

The unknown vector of parameters is $\theta=\left(a, b, \phi, \lambda\right)$. According to Theorem~\ref{thm:id-par}, $p(X,Y)$ is identifiable if either $a$ or $\lambda$ is known. 

In order to prove the above claim, we can construct the following equations: 
\begin{align*}
&\phi_{i}(\theta)=\frac{b\left(x_{i}-x_{0}\right)}{\phi} \\ 
&\zeta_{i}(\theta)=-\frac{\left(a+b x_{i}\right)^{2}-\left(a+b x_{0}\right)^{2}}{2 \phi}-\lambda\left(x_{i}-x_{0}\right), \quad \text{where } i\in (1,\ldots,k)
\end{align*}
The Jacobian matrix is
$$J=\begin{bmatrix}
0 & \frac{x_{1}-x_{0}}{\phi} & -\frac{b\left(x_{1}-x_{0}\right)}{\phi^{2}} & 0\\
% 0 & \frac{x_{2}-x_{0}}{\phi} & -\frac{b\left(x_{2}-x_{0}\right)}{\phi^{2}} & 0\\
\vdots & & & \vdots\\
0 & \frac{x_{k}-x_{0}}{\phi} & -\frac{b\left(x_{k}-x_{0}\right)}{\phi^{2}} & 0\\
-\frac{b\left(x_{1}-x_{0}\right)}{\phi} & -\frac{a\left(x_{1}-x_{0}\right)+b\left(x_{1}^{2}-x_{0}^{2}\right)}{\phi} & \frac{\left(a+b x_{1}\right)^{2}-\left(a+b x_{0}\right)^{2}}{2 \phi^{2}} & -(x_1-x_0)\\
% -\frac{b\left(x_{2}-x_{0}\right)}{\phi} & -\frac{a\left(x_{2}-x_{0}\right)+b\left(x_{2}^{2}-x_{0}^{2}\right)}{\phi} & \frac{\left(a+b x_{2}\right)^{2}-\left(a+b x_{0}\right)^{2}}{2 \phi^{2}} & -(x_2-x_0)\\
\vdots & & & \vdots\\
-\frac{b\left(x_{k}-x_{0}\right)}{\phi} & -\frac{a\left(x_{k}-x_{0}\right)+b\left(x_{k}^{2}-x_{0}^{2}\right)}{\phi} & \frac{\left(a+b x_{k}\right)^{2}-\left(a+b x_{0}\right)^{2}}{2 \phi^{2}} & -(x_k-x_0)\\
\end{bmatrix}.$$
After some rank-preserving operations, we get: 
$$\begin{bmatrix}
0 & x_1-x_0 & x_1-x_0 & 0\\
% 0 & x_2-x_0 & x_2-x_0 & 0\\
\vdots & & & \vdots\\
0 & x_k-x_0 & x_k-x_0 & 0\\
x_{1}-x_{0} & -\left[a\left(x_{1}-x_{0}\right)+b\left(x_{1}^{2}-x_{0}^{2}\right)\right] & -\left[a\left(x_{1}-x_{0}\right)+\frac{b}{2}\left(x_{1}^{2}-x_{0}^{2}\right)\right] & x_1-x_0\\
% x_{2}-x_{0} & -\left[a\left(x_{2}-x_{0}\right)+b\left(x_{2}^{2}-x_{0}^{2}\right)\right] & -\left[a\left(x_{2}-x_{0}\right)+\frac{b}{2}\left(x_{2}^{2}-x_{0}^{2}\right)\right] & x_2-x_0\\
\vdots & & & \vdots\\
x_{k}-x_{0} & -\left[a\left(x_{k}-x_{0}\right)+b\left(x_{k}^{2}-x_{0}^{2}\right)\right] & -\left[a\left(x_{k}-x_{0}\right)+\frac{b}{2}\left(x_{k}^{2}-x_{0}^{2}\right)\right] & x_k-x_0\\
\end{bmatrix}.$$
This concludes the initial claim. 

Note that in this example, \textbf{$p(X\mid Y)$ is in the exponential family} since: 
{\small 
\begin{align*}
p(x \mid y)=\frac{p(y \mid x) p(x)}{p(y)} & =\frac{N\left((a+b x), \sigma_{y}^{2}\right) \lambda e^{-\lambda x}}{p(y)} \\ & =\exp \left\{-\frac{1}{2}\left(\frac{y-(a+b x)}{\sigma_{y}}\right)^{2}+\log \frac{1}{\sqrt{2 \pi} \sigma_{y}}+\log \lambda-\lambda x-\log p(y)\right\}\\
&=\exp \left\{-\frac{1}{2} \frac{\left(x, x^{2}\right)\left(2 a b-2 b y-2 \sigma_{y}^{2} \lambda, b^{2}\right)^{T}+(a-y)^{2}}{\sigma_{y}^{2}}+\log \frac{1}{\sqrt{2 \pi} \sigma_{y}}+\log \lambda-\log p(y)\right\}\\
&=\exp \left\{\left(x, x^{2}\right)\left(-\frac{a b-b y-\sigma_{y}^{2} \lambda}{\sigma_{y}^{2}},-\frac{b^{2}}{2 \sigma_{y}^{2}}\right)^{T}-\frac{(a-y)^{2}}{2 \sigma_{y}^{2}}+\log \frac{1}{\sqrt{2 \pi} \sigma_{y}}+\log \lambda-\log p(y)\right\}. 
\end{align*} 
}


%%%%%%%%%%%% Example 7: X, Y exponential under canonical link

\subsection{$X$ is exponential and $Y \mid X$ is exponential under canonical link}\label{app:parID-ex-exponential}

Suppose 
\begin{align*}
    X &\sim \operatorname{exponential} (\lambda_x) \\
    Y\mid X &\sim \operatorname{exponential} (\lambda)=\exp \{y(-\lambda)+\log \lambda\}=\exp \{y(a+bx)+\log [-(a+bx)]\}.
\end{align*}
The unknown parameter vector is $\theta=\left(a, b, \lambda_x\right)$. According to Theorem~\ref{thm:id-par} and without any further assumptions on $\theta$, $p(X,Y)$ is identifiable. 

In order to prove the above claim, we can construct the following equations: 
\begin{align*}
&\phi_{i}(\theta)=b\left(x_{i}-x_{0}\right) \\ 
&\zeta_{i}(\theta)=\log \left[-(a+b x_{i})\right]-\log \left[-(a+b x_{0})\right]-\lambda_{x}\left(x_{i}-x_{0}\right), \quad  i\in (1, \ldots,k)
\end{align*}
The Jacobian matrix is
$$J=\begin{bmatrix}
0 & x_1-x_0 & 0\\
% 0 & x_2-x_0 & 0\\
\vdots & & \vdots\\
0 & x_k-x_0 & 0\\
\frac{1}{a+b x_{1}}-\frac{1}{a+b x_{0}} & \frac{x_{1}}{a+b x_{1}}-\frac{x_{0}}{a+b x_{0}} & -\left(x_{1}-x_{0}\right)\\
% \frac{1}{a+b x_{2}}-\frac{1}{a+b x_{0}} & \frac{x_{2}}{a+b x_{2}}-\frac{x_{0}}{a+b x_{0}} & -\left(x_{2}-x_{0}\right)\\
\vdots & & \vdots\\
\frac{1}{a+b x_{k}}-\frac{1}{a+b x_{0}} & \frac{x_{k}}{a+b x_{k}}-\frac{x_{0}}{a+b x_{0}} & -\left(x_{k}-x_{0}\right)
\end{bmatrix}.$$
After some rank-preserving operations, we get: 
$$\begin{bmatrix}
0 & x_1-x_0 & 0\\
0 & x_2-x_0 & 0\\
\vdots & & \vdots\\
0 & x_k-x_0 & 0\\
\frac{1}{\left(a+b x_{1}\right)\left(a+b x_{0}\right)} & \frac{1}{\left(a+b x_{1}\right)\left(a+b x_{0}\right)} & 1\\
\frac{1}{\left(a+b x_{2}\right)\left(a+b x_{0}\right)} & \frac{1}{\left(a+b x_{2}\right)\left(a+b x_{0}\right)} & 1\\
\vdots & & \vdots\\
\frac{1}{\left(a+b x_{k}\right)\left(a+b x_{0}\right)} & \frac{1}{\left(a+b x_{k}\right)\left(a+b x_{0}\right)} & 1
\end{bmatrix}.$$
This matrix is full rank and thus it concludes the initial claim. 

Note that in this example, \textbf{$p(X\mid Y)$ is not in exponential family (unless $a$ and $b$ are known)}, since: 
\begin{align*}
    p(x \mid y)=\frac{p(y \mid x) p(x)}{p(y)}=\frac{\exp \left\{y(a+b x)+\log [-(a+b x)]+x(-\lambda x)+\log \lambda_{x}\right\}}{p(y)}. 
\end{align*}
The main difficulty is with the term $\log [-(a+b x)]$. 

%##############################################
\newpage
\section{Estimation Proofs}\label{app:est-proofs} 

\subsection{Theorem~\ref{thm:est-order} \quad {\small (Conditional likelihood with order statistics)}}\label{app:sub-est-order}

\begin{proof}
  Denote $l(\theta) = -\frac{2}{N(N-1)} \sum_{1\leq i<k\leq N} R_{x_i}R_{y_i}R_{x_k}R_{y_k} \log \{1+Q_{ik}(\theta)\}$. Following the Taylor expansion, we have
  \[
  0 = \frac{\partial l(\widetilde\theta)}{\partial\theta} = \frac{\partial l(\theta_0)}{\partial\theta} + (\widetilde\theta-\theta_0)\frac{\partial^2 l(\theta_0)}{\partial\theta^2} + o_p(N^{-1/2}).
  \]
  Therefore, 
  \[
  \sqrt{N}(\widetilde\theta-\theta_0) = - \left\{ \frac{\partial^2 l(\theta_0)}{\partial\theta^2} \right\}^{-1} \sqrt{N}\frac{\partial l(\theta_0)}{\partial\theta} + o_p(1).
  \]
  Since both $\frac{\partial^2 l(\theta_0)}{\partial\theta^2}$ and $\frac{\partial l(\theta_0)}{\partial\theta}$ are U-statistics, from the theory of U-statistics, we have
  \[
  \frac{\partial^2 l(\theta_0)}{\partial\theta^2} \xrightarrow{p} A, \mbox{ and } \sqrt{N}\frac{\partial l(\theta_0)}{\partial\theta} \xrightarrow{d} \N(0, B),
  \]
  which completes the proof.
\end{proof}

%+++++++++++++++++++++++++++++++++++++

\subsection{Theorem~\ref{thm:est-gmm} \quad {\small (Generalized estimating equations)}}\label{app:sub-est-gmm}

\begin{proof}
  The proof of (a) is straightforward following the standard argument of generalized estimating equations, so omitted here. In order to find the optimal choice for $f(Y)$, we can compute
$$\begin{aligned} C & =\E\left\{-\Psi^{\prime}\left(X, Y, R_{x}, R_{y} ; \theta_{0}\right)\right\} \\ & =\E\left[\frac{R_{x} R_{y}}{p\left(R_{y}=1 \mid R_{x}=1, X\right)}  \frac{\partial \E(X \mid Y)}{\partial \theta}\bigg\rvert_{\theta=\theta_0}f(Y)^{T}\right] \\  & =\E\left[R_{x}  \frac{\partial \E(X \mid Y)}{\partial \theta}\bigg\rvert_{\theta=\theta_0}f(Y)^{T}\right]\\
& =\E\left\{w(Y) a(Y) f(Y)^{T}\right\},\end{aligned}$$
and
\begin{align*} D 
& =\E\left[\frac{R_{x} R_{y}}{p^{2}\left(R_{y}=1 \mid R_{x}=1, X\right)}(X-\E(X \mid Y))^{2}f(Y)f(Y)^{T}\right] \\ 
& =\E\left[R_{x} \frac{(X-\E(X \mid Y))^{2}}{\pi(X)}f(Y)f(Y)^{T}\right]\\
&=\E\left[w(Y) \frac{(X-\E(X \mid Y))^{2}}{\pi(X)}f(Y)f(Y)^{T}\right]\\
&=\E\left[w(Y) E\left[\frac{(X-\E(X \mid Y))^{2}}{\pi(X)}\mid Y\right]f(Y)f(Y)^{T}\right]\\
&=\E\left[w(Y)b(Y)f(Y)f(Y)^{T}\right],
\end{align*}
where $b(Y)=\E\left[\frac{(X-\E(X \mid Y))^{2}}{\pi(X)}\mid Y\right]$ and  $w(Y)=p(R_x=1 \mid Y)$.
Based on Cauchy-Schwarz inequality, we have
$$\E\left(u v^{T}\right)\left\{\E\left(v v^{T}\right)\right\}^{-1} \E\left(v u^{T}\right) \lesssim \E\left(u u^{T}\right)$$ with equality hold at $u=v$.
Here $M \lesssim N$ simply means $M-N$ is negative semi-definite.

\noindent Define $v=\sqrt{w(Y)} \sqrt{b(Y)} f(Y)$ and $ u=\sqrt{\frac{w(Y)}{b(Y)}} a(Y)$, then we have
\begin{align*}
    &\E\{w(Y) f(Y) a(Y)^T\}\left[\E\{w(Y) b(Y) f(Y) f(Y)^T\}\right]^{-1} \E\{w(Y) a(Y) f(Y)^T\}\lesssim \E\left\{\frac{w(Y)}{b(Y)} a(Y) a(Y)^T\right\}, i.e.,\\
    &\E\left\{\frac{w(Y)}{b(Y)} a(Y) a(Y)^T\right\}^{-1}\E\{w(Y) b(Y) f(Y) f(Y)^T\}\E\{w(Y) f(Y) a(Y)^T\}^{-1}\gtrsim \E\left\{\frac{w(Y)}{b(Y)} a(Y) a(Y)^T\right\}^{-1}.
\end{align*}
Note that the right-hand side is irrespective of $f(Y)$. Thus, when $f(Y)=f_{opt}(Y)=\frac{a(Y)}{b(Y)}$, 
the equality holds, and we have the optimal variance $\left\{\frac{w(Y)}{b(Y)} a(Y) a(Y)^T\right\}^{-1}$.
\end{proof}


%##############################################
\newpage
\section{Additional discussions on estimation} 
\label{app:est-additional}

\subsection{Nonparametric estimation under additional assumptions}
\label{app:sub-est-permutation}

In addition to independence restrictions in display~(\ref{eq:criss_cross_assump}), we assume $p(R_y = 1 \mid R_x, X)$ is not a function of $X$ when $R_x = 0$. This additional assumptions moves us from the criss-cross MNAR model to the permutation model considered by \cite{robins97non-a}. In the permutation model, one can proceed with estimation of arbitrary functions of $X$ and $Y$ as follows. 

Let our parameter of interest be $\beta_h = \E[h(X, Y)]$, which can be identified via the following function of the observed data: 
$$\begin{aligned}
    \beta_h = \E\left[ \frac{R_x \ R_y \ h(X, Y)}{p(R_x = 1\mid Y) \ p(R_y = 1 \mid R_x = 1, X^*)} \right].  
    \label{eq:mid2}
\end{aligned}$$

\noindent The core idea of deriving the efficient influence function (EIF) for $\beta_h$ is to use an intermediate variable that first takes care of the missingness of $X$, and then $Y$ in a sequential manner. Intuitively, this is due to the fact that we can rewrite $\beta_h$ via an intermediate variable $\widetilde{\beta}_h(X, R_x, Y)$ as follows:
 \begin{align*}
    \widetilde{\beta}_h(X, R_x, Y) &= \frac{R_x}{p\left(R_x=1 \mid Y \right)} \ h\left(X, Y \right), \qquad 
     \beta_h =  \E\left[\frac{R_y  }{p\left(R_y=1 \mid R_x, X^*\right)} \ \widetilde{\beta}_h(X, R_x, Y) \right]. 
 \end{align*}
The claim made by \cite{robins97non-a} is that EIF for $\beta_h$ is equal to the EIF for $ \E\left[\displaystyle \frac{R_y  }{p\left(R_y=1 \mid R_x, X^*\right)} \ \phi(\widetilde{\beta}_h) \right]$, where $\phi(\widetilde{\beta}_h) = \text{EIF}_{\widetilde{\beta}_h} + \E[\widetilde{\beta}_h]$ and  $\text{EIF}_{\widetilde{\beta}_h}$ denotes the efficient influence function for $\E\big[\widetilde{\beta}_h(X, R_x, Y)\big]$. 
Therefore, we first need to derive the EIF for $\E\big[\widetilde{\beta}_h(X, R_x, Y)\big]$. 
%Let $\phi(\widetilde{\beta}_h) = \text{EIF}_{\widetilde{\beta}_h} + \E[\widetilde{\beta}_h] = \text{EIF}_{\widetilde{\beta}_h} + \E[h(X, Y)].$
$$ \resizebox{\textwidth}{!}{\begin{aligned}
  \left.\frac{\partial \E[\widetilde{\beta}_h\left(p_{\varepsilon}\right)]}{\partial \varepsilon}\right|_{\varepsilon=0}&= \left.\frac{\partial}{\partial \varepsilon} \int \frac{R_x h\left(X, Y\right)}{p\left(R_x=1 \mid Y\right)} d p_{\varepsilon}\left(X, Y, R_x\right)\right|_{\varepsilon=0} \\
  &= -\int \frac{R_x h\left(X, Y\right)}{p\left(R_x=1 \mid Y\right)} S\left(R_x \mid Y\right) d p\left(X, Y, R_x\right) 
 +\int \frac{R_x h\left(X, Y\right)}{p\left(R_x=1 \mid Y\right)} S\left(X, Y, R_x\right) d p\left(X, Y, R_x\right) \\
  &= -\int \frac{R_x \E\left[h\left(X, Y\right) \mid R_x=1, Y\right]}{p\left(R_x=1 \mid Y\right)} S\left(R_x \mid Y\right) d p\left(R_x, Y\right)
  +\int\left\{\frac{R_x h\left(X, Y\right)}{p\left(R_x=1 \mid Y\right)}-\E\left[h\left(X, Y\right)\right]\right\} S\left(X, Y, R_x\right) d p\left(X, Y, R_x\right)\\
  & =-\int\left\{\frac{R_x \E\left[h\left(X, Y\right) \mid R_x=1, Y\right]}{p\left(R_x=1 \mid Y\right)}-\E\left[h\left(X, Y\right) \mid R_x=1, Y\right]\right\} S\left(R_x, Y\right) d p\left(R_x, Y\right) \\
  & +\int\left\{\frac{R_x h\left(X, Y\right)}{p\left(R_x=1 \mid Y\right)}-\E\left[h\left(X, Y\right)\right]\right\} S\left(X, Y, R_x\right) d p\left(X, Y, R_x\right)\\
  & =-\int\left\{\frac{R_x \E\left[h\left(X, Y\right) \mid R_x=1, X\right]}{p\left(R_x=1 \mid Y\right)}-\E\left[h\left(X, Y\right) \mid R_x=1, Y\right]\right\} S\left(Y, R_x, X\right) d p\left(R_x, X, Y\right) \\
  &+\int \left\{\frac{R_x h\left(X, Y\right)}{p\left(R_x=1 \mid Y\right)}-\E\left[h\left(X, Y\right)\right]\right\} S\left(X, Y, R_x\right) d p\left(X, Y, R_x\right). 
  \end{aligned}}$$

\noindent Therefore, the efficient influence function for $\E[\widetilde{\beta}_h]$, denoted by $\text{EIF}_{\widetilde{\beta}_h}$, is 
as follows
\begin{align*}
    \text{EIF}_{\widetilde{\beta}_h} &= \frac{R_x }{p\left(R_x=1 \mid Y\right)} \Big\{h\left(X, Y\right) - \E\left[h\left(X, Y\right) \mid R_x = 1, Y \right] \Big\} + \Big\{ \E[h(X, Y) \mid R_x = 1, Y] - \E[h(X, Y)] \Big\}. 
\end{align*}
Thus we get: 
\begin{align*}
    \phi(\widetilde{\beta}_h) = \frac{R_x }{p\left(R_x=1 \mid Y\right)} \Big\{h\left(X, Y\right) - \E\left[h\left(X, Y\right) \mid R_x = 1, Y \right] \Big\} + \E\big[ h(X, Y) \mid R_x = 1, Y \big]. 
\end{align*}

Following a similar procedure, we can easily obtain the EIF for $ \E\left[\displaystyle \frac{R_y  }{p\left(R_y=1 \mid R_x, X^*\right)} \ \phi(\widetilde{\beta}_h) \right]$, which yields the EIF for $\beta_h$ as follows: 
\begin{align*}
    \text{EIF}_{\beta_h} = 
    \frac{R_y}{p\left(R_y=1 \mid R_x, X^*\right)}\Big\{ \phi(\widetilde{\beta}_h) \ - \ \E\big[ \phi(\widetilde{\beta}_h) \mid R_y, R_x, X^* \big]  \Big\} 
    + \Big\{  \  \E\big[ \phi(\widetilde{\beta}_h) \mid R_y = 1, R_x, X^*\big] - \beta_h \Big\}. 
\end{align*}

% \anna{
% $$\begin{aligned}
% \text{EIF}_{\widetilde{\beta}} & =\frac{R_x h\left(X, Y\right)}{p\left(R_x=1 \mid Y\right)}-\E\left[h\left(X, Y\right)\right]+\left[\frac{R_x}{p\left(R_x=1 \mid Y\right)}-1\right] \E\left[h\left(X, Y\right) \mid R_x=1, Y\right] \\
% % & =\frac{R_x h\left(X^*, Y\right)}{p\left(R_x=1 \mid Y\right)}-\E\left[h\left(X, Y\right)\right]+\left[\frac{R_x}{p\left(R_x=1 \mid Y\right)}-1\right] \E\left[h\left(X^*, Y\right) \mid R_x=1, Y\right]\\
% &=\frac{R_x h\left(X^*, Y\right)}{p\left(R_x=1 \mid Y\right)}-\E\left[\frac{R_x h\left(X^*, Y\right)}{p\left(R_x=1 \mid Y\right)}\right]+\left[\frac{R_x}{p\left(R_x=1 \mid Y\right)}-1\right] \E\left[h\left(X^*, Y\right) \mid R_x=1, Y\right].
% \end{aligned}$$
% %
% Note that $Y$ is actually not fully observed, and $EIF_1$ is a function of counterfactual $Y$. Let's denote $EIF_1=f(Y)$. We can estimate $f(Y)$ with the following estimating equation $$f\left(Y\right)=\E\left[\frac{R_y f\left(Y\right)}{p\left(R_y=1 \mid R_x, X^*\right)}\right].$$
% Following similar procedure as above, we have the efficient influence function of $f(Y)$ as
% $$EIF_2=\frac{R_y f(Y)}{p\left(R_y=1 \mid R_x, X^*\right)}-\E\left[f(Y)\right]+\left[\frac{R_y}{p\left(R_y=1 \mid R_x, X^*\right)}-1\right] \E\left[f(y) \mid R_y=1, R_x, X^*\right].$$
% Writing out the expression for $f(Y)$ in $EIF_2$, we get the final influence function for $\beta_h$ as\\
% $$\resizebox{\textwidth}{!}{\begin{aligned}
%     EIF_2&=\frac{R_y \bigg\{\frac{R_x h\left(X^*, Y\right)}{p\left(R_x=1 \mid Y\right)}-E\left[\frac{R_x h\left(X^*, Y\right)}{p\left(R_x=1 \mid Y\right)}\right]+\left[\frac{R_x}{p\left(R_x=1 \mid Y\right)}-1\right] E\left[h\left(X^*, Y\right) \mid R_x=1, Y\right]\bigg\}}{p\left(R_y=1 \mid R_x, X^*\right)}\\
%     &-E\bigg\{\frac{R_x h\left(X^*, Y\right)}{p\left(R_x=1 \mid Y\right)}-E\left[\frac{R_x h\left(X^*, Y\right)}{p\left(R_x=1 \mid Y\right)}\right]+\left[\frac{R_x}{p\left(R_x=1 \mid Y\right)}-1\right] E\left[h\left(X^*, Y\right) \mid R_x=1, Y\right]\bigg\}\\
%     &+\left[\frac{R_y}{p\left(R_y=1 \mid R_x, X^*\right)}-1\right] E\bigg\{\frac{R_x h\left(X^*, Y\right)}{p\left(R_x=1 \mid Y\right)}-E\left[\frac{R_x h\left(X^*, Y\right)}{p\left(R_x=1 \mid Y\right)}\right]+\left[\frac{R_x}{p\left(R_x=1 \mid Y\right)}-1\right] E\left[h\left(X^*, Y\right) \mid R_x=1, Y\right] \mid R_y=1, R_x, X^*\bigg\}
% \end{aligned}}$$
% }

%+++++++++++++++
\vspace{0.5cm}
\subsection{Maximum likelihood estimation} 

In the criss-cross MNAR model, the \textit{observed full data likelihood}, denoted by $\mathcal{L}_\text{obs}(Z; \theta)$, can be written down as follows: 
\begin{align*}
	\mathcal{L}_\text{obs}(X, Y, R; \theta, \psi) 
	&= \prod_{R_x=1, R_y=1} p(X, Y, R_x=1, R_y=1) \times \prod_{R_x=1, R_y=0}  \int p(X, Y, R_x=1, R_y=0) dy \\
	&\ \times \prod_{R_x=0, R_y=1} \int p(X, Y, R_x=0, R_y=1) dx \times \prod_{R_x=0, R_y=0}  \int p(X, Y, R_x=0, R_y=0) dxdy \\
\end{align*}

Under the conditions of Theorem~\ref{thm:id-par} and Condition~\ref{cond:completeness}, one can simply estimate the entire parameter vector of the full law, assuming the parametric forms of the propensity scores in the missingness mechanism are known. 

%##############################################
\newpage
\section{Additional experimental results}\label{app:sims}

\subsection{Simulation results}
%%%%%%%%%%% case 2
\noindent\underline{\textbf{Varying $\rho$.}} We examine the effect of changing the correlation coefficient on the efficiency of the estimators by varying $\rho$ across the range of values from -0.9 to 0.9, with increments of $0.2$. The sample size used is $N=1000$. Table~\ref{table:2} displays the standard deviation (SD) of the three suggested estimators for different values of $\rho$. To avoid distorting the SD patterns after applying the Delta method, we summarize the SD of the direct estimates of each method instead of converting it to OR. The results indicate that both GEE methods provide more efficient estimators when $X$ and $Y$ are highly correlated, but exhibit more estimation uncertainty when the correlation is low. In contrast, the conditional likelihood estimator has less variability when the correlation is low.

\input{table2.tex}

%%%%%%%%%%% case 3
\noindent\underline{\textbf{Model misspecification.}} To understand the behavior of the proposed estimators under model misspecification, we generate data under missing mechanism for $Y$ as $p(R_y=1 \mid X,R_x)= \operatorname{expit}(2-R_x+0.7X+0.2X^2)$. While estimation with GEE is carried out, the relations between $R_y$ and $\{X,R_x\}$ is assumed to be linear. Under model misspecification, Figure~\ref{fig:mis} illustrates that both GEE methods fail to provide an unbiased estimate of the OR despite an increasing sample size. The conditional likelihood still yields unbiased estimates especially with large sample size. Same observation is made in the estimation of $\alpha$ and $\beta$ as shown in Table~\ref{table:3}. Bias and high MSE persist for both methods even with large sample size whereas SD shrinks as sample size increases.
% \begin{figure}[t]
%     \centering
%     \includegraphics[scale=0.15]{mis_OR_est.png}
%     \caption{OR estimation under model misspecification. }
%     \label{fig:mis}
% \end{figure}

\begin{figure*}[t]
    \centering
    \includegraphics[height=5.3cm, width=16cm]{mis_OR_est_plot.png}
    \caption{OR estimation under model misspecification.}
    \label{fig:mis}
\end{figure*}

\input{table3.tex}

The simulation results indicate all three methods yield unbiased estimators when the model is correctly specified. GEE methods are more efficient than the conditional likelihood. As expected,  the optimal GEE is consistently more efficient than the non-optimal GEE regardless of the sample size. On the other hand, for OR estimation, the conditional likelihood method is more robust under model misspecification meaning that it yields unbiased estimators even when $p(R_y\mid X,R_x)$ is misspecified. In the presence of a strong correlation between $X$ and $Y$, the GEE estimators exhibit higher efficiency. Conversely, under conditions of weak correlation, the conditional likelihood estimator displays higher efficiency.

\subsection{Real data results}

We also applied our proposed methods to analyze data from the KLIPS dataset, which includes information on monthly income for 2511 regular wage earners in 2005 and 2006. The combined monthly income for these two years has approximately 40\% missing data. Our objective was to investigate whether past income has a lasting effect on future income. We defined $X$ as the logarithm of monthly income in 2005 and $Y$ as the logarithm of monthly income in 2006. Based on empirical data distributions, we assumed that $X$, $Y$, and $X|Y$ are normally distributed. Specifically, we modeled $X|Y$ as $\N(\alpha + \beta Y, \sigma^2)$, where $\sigma^2$ was empirically estimated.

Using our nonparametric identification results, we were able to determine $\alpha$ and $\beta$ without making any additional assumptions. For estimating these parameters, we employed generalized estimating equations (GEEs). Additionally, we used all three methods to estimate $\log(OR)$, where OR represents the odds ratio between the income of the two years. The parameter estimates obtained are summarized in Table~\ref{table:continuous}.

\input{table_realdata_continuous}

The findings presented above indicate a significant and persistent effect of income. Specifically, high income in the past is strongly predictive of high income in the future, and conversely, low income in the past is predictive of low income in the future. These results provide confirmation that the optimal GEE approach outperforms the non-optimal GEE, particularly in terms of higher efficiency when dealing with continuous variable distributions. 


%##############################################
\newpage
\bibliography{references}

\end{document}
