% \documentclass{uai2023} % for initial submission
\documentclass[accepted]{uai2023} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%############################################
% packages added by authors 

\usepackage{xr}
% \externaldocument{}

% packages for tables
\usepackage{array}
\usepackage{caption}
\usepackage{graphicx}
\usepackage{siunitx}
\usepackage[normalem]{ulem}
\usepackage{colortbl}
\usepackage{multirow}
\usepackage{hhline}
\usepackage{calc}
\usepackage{tabularx}
\usepackage{threeparttable}
\usepackage{wrapfig}
\usepackage{adjustbox}
\usepackage{hyperref}


\usepackage{import}
\usepackage{amsmath, amssymb, amsfonts}
\usepackage{xcolor}
\usepackage{color}
%\usepackage{tikz}
\usetikzlibrary{tikzmark}
\usepackage{multirow}
\usepackage{bm}
\usepackage{graphicx}
\usepackage{enumerate}
\usepackage[makeroom]{cancel}
\usepackage{hyperref}
\usepackage{xcolor}

\usepackage{amsthm}
\newtheorem{corollary}{Corollary}
\newtheorem{example}{Example}
\newtheorem{proposition}{Proposition}
\newtheorem{lemma}{Lemma}
\newtheorem{theorem}{Theorem}
\newtheorem{remark}{Remark}
\newtheorem{claim}{Claim}
\newtheorem{definition}{Definition}
\newtheorem{condition}{Condition}

\newcommand{\red}{\textcolor{red}}
\def\ci{\perp\!\!\!\perp}
\newcommand{\E}{\mathbb{E}}
\newcommand{\N}{\mathbb{N}}
\DeclareMathOperator{\pa}{pa} 
\newcommand{\anna}{\textcolor{olive}} 

%############################################

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Sufficient Identification Conditions and Semiparametric Estimation under Missing Not at Random Mechanisms}

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<anna.guo@emory.edu>?Subject=Your UAI 2023 paper}{Anna~Guo}{}}
\author[2]{Jiwei~Zhao}
\author[1]{Razieh~Nabi}
% Add affiliations after the authors
\affil[1]{%
    Dept. of Biostatistics and Bioinformatics\\
    Emory University\\
    Atlanta, Georgia, USA
}
\affil[2]{%
    Dept. of Biostatistics \& Medical Informatics\\
    University of Wisconsin\\
    Madison, Wisconsin, USA
}
  
\begin{document}
\maketitle

\begin{abstract}
  Conducting valid statistical analyses is challenging in the presence of missing-not-at-random (MNAR) data, where the missingness mechanism is dependent on the missing values themselves even conditioned on the observed data. Here, we consider a MNAR model that generalizes several prior popular MNAR models in two ways: first, it is less restrictive in terms of statistical independence assumptions imposed on the underlying joint data distribution, and second, it allows for all variables in the observed sample to have missing values. This MNAR model corresponds to a so-called \textit{criss-cross} structure considered in the literature on graphical models of missing data that prevents nonparametric identification of the entire missing data model. Nonetheless, part of the complete-data distribution remains nonparametrically identifiable. By exploiting this fact and considering a rich class of exponential family distributions, we establish sufficient conditions for identification of the complete-data distribution as well as the entire missingness mechanism. We then propose methods for testing the independence restrictions encoded in such models using odds ratio as our parameter of interest. We adopt two semiparametric approaches for estimating the odds ratio parameter and establish the corresponding asymptotic theories: one involves maximizing a conditional likelihood with order statistics and the other uses estimating equations. The utility of our methods is illustrated via simulation studies. 
\end{abstract}

%##############################################
\section{Introduction}
\label{sec:intro}
%############################################## 

% Missingness hierarchy 
Conducting valid statistical analyses is challenging in the presence of missing data as the observed data may not be representative of the population of interest. According to the terminology of \cite{rubin1976inference}, a missingness mechanism is called missing-at-random (MAR) if it only depends on the observed data values, and it is called missing-not-at-random (MNAR) if it is dependent on the missing values themselves even conditioned on the observed data. Under a MAR model, identification of a target parameter as a function of the observed data is a relatively straightforward task, and estimation strategies are well-studied, ranging from likelihood-based methods such as expectation maximization \citep{dempster77maximum, little2002statistical}, to multiple imputation \citep{rubin87multiple}, inverse probability weighting \citep{robins1994estimation, li2013weighting}, and semiparametric methods closely related to the estimation of causal parameters \citep{robins1995analysis, tsiatis06missing}. On the other hand, MNAR mechanisms are substantially more complicated and under-studied, yet they are construed as the most prevalent form of missingness mechanisms in practice.  

 % MANR non-identification 
In the presence of MNAR mechanisms, it is generally not possible to express the underlying \textit{complete-data} distribution as a function of the \textit{observed data} distribution without imposing additional assumptions. A lack of identification result implies that there exist at least two models that differ in their respective complete-data distribution but share the same observed data distribution. A well-known example of a non-identified MNAR mechanism is the non-ignorable non-response model in survey sampling, where the response variable directly causes its own missingness, often referred to as a \textit{self-censoring} missingness mechanism. Other MNAR models include scenarios where missingness of a variable depends on other variables that themselves could be missing. 

% Different approaches to handling missing data
Common approaches for making progress in non-identified MNAR models include imposing, often untestable, (semi)parametric assumptions that yield identification \citep{wu1988estimation, little2002statistical, zhao2015semiparametric}. For instance, in order to deal with the self-censoring mechanism involving a univariate response variable, several authors have considered the presence of a fully observed variable along with certain assumptions to identify and estimate distributional quantities involving the response variable -- e.g., \cite{wang2014instrumental} considers a \textit{shadow variable}\footnote{The authors refer to $X$ as the instrumental variable. However, following the work of \citep{miao2015identification}, we believe it is more appropriate to label $X$ as the shadow variable.} that is not determinant of the underlying missingness, and \cite{sun2018semiparametric} considers an \textit{instrumental variable} that is dependent with the missingness indicator of the response variable but independent of the response variable itself (marginally or conditioned on other fully observed variables). Other approaches include conducting sensitivity analysis \citep{rotnitzky1998semiparametric, scharfstein2003generalized, scharfstein2021semiparametric} or obtaining nonparametric bounds for parameters of interest \citep{horowitz2000nonparametric}. A recent line of work considers missing data models with a collection of independence restrictions among variables and corresponding missingness indicators that can be represented by directed acyclic graphs (DAGs); see \cite{nabi2022causal} for a detailed discussion. 

% The criss-cross MNAR model 
In this work, we consider a MNAR model that corresponds to a graphical characterization, the \textit{criss-cross} structure discussed in \cite{nabi2022testability}, where missingness of the response variable depends on the missingness of covariates and vice versa. This kind of missingness is common in cross-sectional and survey studies. Unlike most prior work, all variables in our model can be subject to missingness, i.e., our results do not rely on the presence of fully observed variables. Furthermore, the MNAR model under study generalizes several prior popular missing data models, including the permutation model \citep{robins97non-a}, the block-conditional MAR model \citep{zhou10block}, and the block-parallel model \citep{mohan13missing}, making it less restrictive in terms of statistical independence assumptions imposed on the underlying joint data distribution. 

% Out contributions 
The criss-cross MNAR structure prevents nonparametric identification of the entire missing data model. We show, however, part of the complete-data distribution remains nonparametrically identifiable. We consider a quantitative measure, based on the rank of a \textit{Jacobian matrix}, to examine the amount of information in the identifiable part that would be sufficient for recovering the entire complete-data law, a.k.a. the \textit{target law}, as a function of only partially observed data. We explore these sufficient conditions extensively in the rich class of exponential family distributions. We further extend these results to higher dimensional parameter spaces and explore identifiability conditions for the entire missingness selection model, studied under \textit{full law} identification. Aside from identification arguments, we explore procedures for testing independence relations among variables that are themselves missing in terms of an \textit{odds ratio} parameterization of the complete-data law, as well as other model assumptions. We propose semiparametric estimating equations and conditional likelihoods based on order statistics to compute parameters that can be used for model selection purposes. Asymptotic properties of these two approaches are studied. We  show empirically that the estimating equation approach is more efficient compared to the conditional likelihood approach while the latter is more robust to misspecifications of the missingness selection model. 

% Organization of the paper 
The paper is organized as follows. We describe our notation and a brief overview of missing data DAGs in Section~\ref{sec:prelim}, and formally define the MNAR model under study in Section~\ref{sec:model}. We first consider univariate settings and discuss our (non)parametric identification and semiparametric estimation results in Sections~\ref{sec:ID} and \ref{sec:estimation}, respectively, followed by generalizations to multidimensional covariate spaces in Section~\ref{sec:high-dim}. The simulation results are provided in Section~\ref{sec:sims}, followed by conclusions in Section~\ref{sec:conc}.  All proofs are deferred to supplementary materials. 

%##############################################
\section{Preliminaries}  
\label{sec:prelim}
%##############################################

Let $Z$ be a vector of random variables with finite support and probability density $p(Z).$ Given a finite sample, variables in $Z$, indexed here by $k$, may have missing instances. 
%Assume $Z = (Z_1, \ldots, Z_K)^T$ contains $K$ variables, and 
Let $R$ be the corresponding vector of binary missingness indicators where $R_{k} = 1$ if $Z_{k}$ is observed and $R_{k}=0$ if $Z_{k}$ is missing. We only observe a coarsened version of $Z$ in our sample, which we denote by $Z^*.$ Each $Z^*_k \in Z$ is deterministically defined as follows: $Z^*_k = Z_k$ if $R_k = 1$ and $Z^* = \ ``?"$ if $R_k=0.$ $Z$ has a counterfactual connotation as it corresponds to variables ``had they been fully observed"  or  ``had $R$  been set to one" (no missingness) -- see \cite{bhattacharya19mid}. We use lowercase $z$ to denote the observed realization of $Z$.

Following the literature on graphical models of missing data, it is descriptive to use directed acyclic graphs (DAGs) to encode assumptions in a given missing data model. A DAG ${\cal G}(V)$ is a set of vertices $V$ connected by directed edges such that there are no directed cycles. The statistical model of a DAG ${\cal G}(V)$ is a set of distributions that factorize as $p(V) = {\prod}_{V_i \in V}p(V_i \mid \pa_{\cal G}(V_i) )$, where $\pa_{\cal G}(V_i)$ denotes parents (direct causes) of $V_i$ in ${\cal G}(V)$; when the vertex set is clear from the context, ${\cal G}(V)$ is abbreviated as ${\cal G}$. Using the conventions in \cite{mohan13missing, bhattacharya19mid}, a missing data DAG (or mDAG for short) is defined over the set of vertices that correspond to variables in $V = \{Z, R, Z^*\}.$ In addition to acyclicity, a mDAG restricts the presence of certain edges: each $Z^*_k \in Z^*$ has only two parents ($Z_k$ and $R_k$), $Z^*_k$ does not have any outgoing edges and variables in $R$ cannot point to variables in $Z$. As an example, Fig.~\ref{fig:nonignor} illustrates the self-censoring mechanism in (a), the shadow variable setup in (b), and the instrumental variable approach in (c). Here, $Y$ is the non-response variable, and $X, W$ are fully observed variables. Deterministic edges are drawn in gray in all mDAGs. 

\begin{figure}[t] 
	\begin{center}
		\scalebox{0.7}{
			\begin{tikzpicture}[>=stealth, node distance=1.5cm]
				\tikzstyle{format} = [thick, minimum size=1.0mm, inner sep=2pt]
				\tikzstyle{square} = [draw, thick, minimum size=4.5mm, inner sep=2pt]
					
				\begin{scope}[xshift=0cm]
					\path[->, thick]
					node[] (y) {$Y$}
					node[below of=y] (r) {$R_y$}
					node[right of=r] (ys) {$Y^*$}
					
					(y) edge[blue] (r) 
					
					(y) edge[gray] (ys)
					(r) edge[gray] (ys)
					
					node[format, below of=r, xshift=0.75cm, yshift=0.75cm] (a) {(a)} ;
				\end{scope}
				
				\begin{scope}[xshift=2.75cm]
					\path[->, thick]
					node[] (x) {$X$}
					node[right of=x, xshift=0.25cm] (y) {$Y$}
					node[below of=x, xshift=0.cm] (w) {$W$}
					node[below of=y] (r) {$R_y$}
					node[right of=r] (ys) {$Y^*$}
					
					(x) edge[blue] (y) 
					(y) edge[blue] (r) 
					
					(y) edge[gray] (ys)
					(r) edge[gray] (ys)
					
					(w) edge[blue, dashed] (x)
					(w) edge[blue, dashed] (y)
					(w) edge[blue, dashed] (r)
					
					node[format, below of=r, xshift=0.0cm, yshift=0.75cm] (b) {(b)} ;
				\end{scope}
				
				\begin{scope}[xshift=7.25cm]
					\path[->, thick]
					node[] (x) {$X$}
					node[right of=x, xshift=0.25cm] (y) {$Y$}
					node[below of=x, xshift=0.cm] (w) {$W$}
					node[below of=y] (r) {$R_y$}
					node[right of=r] (ys) {$Y^*$}
					
					(x) edge[blue] (r) 
					(y) edge[blue] (r) 
					
					(y) edge[gray] (ys)
					(r) edge[gray] (ys)
					
					(w) edge[blue, dashed] (x)
					(w) edge[blue, dashed] (y)
					(w) edge[blue, dashed] (r)
					
					node[format, below of=r, xshift=0.0cm, yshift=0.75cm] (c) {(c)} ;
				\end{scope}
				
			\end{tikzpicture}
		}
		\caption{(a) Self-censoring MNAR mechanism; (b) Shadow variable setup considered in \cite{wang2014instrumental}; (c)  Instrumental variable setup considered in \cite{sun2018semiparametric}. A dashed edge implies potential dependence between the endpoint variables.}
		\label{fig:nonignor}
	\end{center}
\end{figure}

A missing data model associated with a mDAG ${\cal G}$ is the set of distributions $p(Z, R, Z^*)$ that factorize as 
\begin{align}
	\prod_{V_i \in Z} \ p(V_i \mid \pa_{\cal G} (V_i))  \times \prod_{R_k \in R} \ p(R_k \mid \pa_{\cal G}(R_k)).  %\prod_{Z^*_k \in Z^*}  \ p(Z^*_k \mid Z_k, R_k). 
    \label{eq:mDAG_fact}
\end{align}% 
We exclude the factors $p(Z^*_k \mid Z_k, R_k)$ which are deterministically defined.%, i.e., $Z^*_k = Z_k$ when $R_k = 1$, and $Z^*_k = \ ``?"$ otherwise. 
Similar to a DAG, a mDAG encodes a set of ordinary conditional independence restrictions which can be easily read via Markov properties and d-separation rules: given disjoint subsets of vertices $A, B, C,$ the DAG global Markov property states that if $A \perp_\text{d-sep} B \mid C$ in ${\cal G}(V)$, then $A \perp B \mid C$ in $p(V)$ \citep{pearl09causality}. We refer to $p(Z)$ as  the \emph{target law}, $p(R  \mid Z)$  as  the \emph{missingness mechanism}, and $p(R, Z^*)$ as the \emph{observed data law}. The product of target law and missingness mechanism, i.e., $p(Z, R)$, is  referred to as the \emph{full law}. Note that in addition to partially missing variables, we may also have variables that are fully observed. However, in this work, we allow for the possibility of having all variables be partially missing in our model. 

Aside from the mDAG factorization, an \textit{odds ratio} parameterization  of the full law (or parts of it) can be useful in handling missing data models as it is illustrated by our methods in later sections; for more use of such parameterization see \cite{nabi20completeness, malinsky2021semiparametric}. Given disjoint sets of variables $A,B,C$ and reference values $A=a_0, B=b_0,$ the odds ratio parameterization of $p(A=a, B=b \mid C)$, given by \cite{chen07semiparametric}, is as follows: 
\begin{align}
	%p(A = a, B = b \mid C) = 
    \frac{1}{Z(C)} \times p(a \mid b_0, C) \times p(b \mid a_0, C) \times \text{OR}(a, b \mid C),
	\label{eq:odds_ratio}
\end{align}	
where $\text{OR}(A = a, B = b \mid C)$ is defined as 
\begin{align*}
	&\frac{p(A = a \mid B = b, C)}{p(A = a_0 \mid B = b, C)} \times \frac{p(A = a_0 \mid B = b_0, C)}{p(A = a \mid B = b_0, C)},
\end{align*}%
and $Z(C) = \sum_{A,B} \ p(A | B = b_0, C) \times p(B | A = a_0, C) \times \text{OR}(A, B \mid C)$ is the normalizing term. 


%##############################################
\section{The MNAR missing data model}  
\label{sec:model}
%##############################################

We partition $Z$ into two disjoint sets $X$ and $Y$, where the missingness of $X$ and $Y$ depend on each other as follows: 
\begin{align}
    (i) \ R_x \perp X \mid Y \qquad 
    (ii) \ R_y \perp Y \mid X, R_x
    \label{eq:criss_cross_assump}
\end{align}
The above set of assumptions can be represented via the mDAG shown in Fig.~\ref{fig:cross_model}(a), which corresponds to the so-called \textit{criss-cross} structure discussed in \cite{nabi2022testability}. This missing data model is a supermodel of several popular models in the literature. It relaxes the independence restrictions among variables imposed by models such as the \textit{permutation model} \citep{robins97non-a} shown in Fig.~\ref{fig:cross_model}(b), \textit{block-parallel model}  \citep{mohan13missing} shown in Fig.~\ref{fig:cross_model}(c), and \textit{block-conditional MAR model} \citep{zhou10block} shown in Fig.~\ref{fig:cross_model}(d). For instance, the permutation model implies the following set of independence restrictions: \textit{(i)} $R_x \perp X \mid Y$ and \textit{(ii)} $R_y \perp Y, X \mid X^*, R_x$. 
% \begin{align*}
%     (i) \ R_x \perp X \mid Y \qquad 
%     (ii) \ R_y \perp Y, X \mid X^*, R_x
% \end{align*}
The independence restriction in \textit{(ii)} implies $R_y \perp Y \mid X, R_x = 1$  and $R_y \perp Y, X \mid R_x = 0$. These assumptions are a superset of the assumptions made in the criss-cross model, as defined in (\ref{eq:criss_cross_assump}). For more detailed comparisons across the aforementioned models, see \cite{nabi2022causal}.  

\begin{figure}[t] 
	\begin{center}
		\scalebox{0.7}{
			\begin{tikzpicture}[>=stealth, node distance=1.2cm]
				\tikzstyle{format} = [thick, circle, minimum size=1.0mm, inner sep=2pt]
				\tikzstyle{square} = [draw, thick, minimum size=4.5mm, inner sep=2pt]
				
				
				\begin{scope}[xshift=0cm]
					\path[->, thick]
					node[] (x) {$X$}
					node[right of=x, xshift=0.5cm] (y) {$Y$}
					node[below of=y] (r) {$R_y$}
					node[below of=x] (rx) {$R_x$}
					node[below of=rx, yshift=0.cm] (xs) {$X^*$}
					node[below of=r, yshift=0.cm] (ys) {$Y^*$}
					
					(x) edge[blue] (r) 
					(rx) edge[blue] (r) 
					(y) edge[blue] (rx) 
					
					(x) edge[blue] (y) 
					
					(y) edge[gray, bend left] (ys)
					(r) edge[gray] (ys)
					
					(x) edge[gray, bend right] (xs)
					(rx) edge[gray] (xs)
					
					node[format, below of=xs, xshift=1.cm, yshift=0.55cm] (a) {(a)} ;
				\end{scope}
				
				\begin{scope}[xshift=3cm, yshift=0cm]
					\path[->, thick]
					node[] (x) {$X$}
					node[right of=x, xshift=0.5cm] (y) {$Y$}
					node[below of=y] (r) {$R_y$}
					node[below of=x] (rx) {$R_x$}
					node[below of=rx, yshift=0.cm] (xs) {$X^*$}
					node[below of=r, yshift=0.cm] (ys) {$Y^*$}
					
					(xs) edge[blue] (r) 
					(rx) edge[blue] (r) 
					(y) edge[blue] (rx) 
					
					(x) edge[blue] (y) 
					
					(y) edge[gray, bend left] (ys)
					(r) edge[gray] (ys)
					
					(x) edge[gray, bend right] (xs)
					(rx) edge[gray] (xs)
					
					node[format, below of=xs, xshift=1.cm, yshift=0.55cm] (b) {(b)} ;
				\end{scope}
				
				\begin{scope}[xshift=6cm, yshift=0cm]
					\path[->, thick]
					node[] (x) {$X$}
					node[right of=x, xshift=0.5cm] (y) {$Y$}
					node[below of=y] (r) {$R_y$}
					node[below of=x] (rx) {$R_x$}
					node[below of=rx, yshift=0.cm] (xs) {$X^*$}
					node[below of=r, yshift=0.cm] (ys) {$Y^*$}
					
					(x) edge[blue, -] (y) 
					(x) edge[blue] (r) 
					(y) edge[blue] (rx) 
					
					(x) edge[blue] (y) 
					
					(y) edge[gray, bend left] (ys)
					(r) edge[gray] (ys)
					
					(x) edge[gray, bend right] (xs)
					(rx) edge[gray] (xs)
					
					node[format, below of=xs, xshift=1.cm, yshift=0.55cm] (c) {(c)} ;
				\end{scope}
				
				\begin{scope}[xshift=9cm, yshift=0cm]
					\path[->, thick]
					node[] (x) {$X$}
					node[right of=x, xshift=0.5cm] (y) {$Y$}
					node[below of=y] (r) {$R_y$}
					node[below of=x] (rx) {$R_x$}
					node[below of=rx, yshift=0.cm] (xs) {$X^*$}
					node[below of=r, yshift=0.cm] (ys) {$Y^*$}
					
					(x) edge[blue] (r) 
					(rx) edge[blue] (r) 
					
					(x) edge[blue] (y) 
					
					(y) edge[gray, bend left] (ys)
					(r) edge[gray] (ys)
					
					(x) edge[gray, bend right] (xs)
					(rx) edge[gray] (xs)
					
					node[format, below of=xs, xshift=1.cm, yshift=0.55cm] (d) {(d)} ;
				\end{scope}
				
			\end{tikzpicture}
		}
		\caption{(a) Criss-cross MNAR model; (b) Permutation model \citep{robins97non-a}; (c) Block-parallel model \citep{mohan13missing}; (d) Block-conditional MAR model \citep{zhou10block}.} 
		\label{fig:cross_model}
	\end{center}
\end{figure}

The importance of the criss-cross graphical characterization is that in the presence of such structure, the target law is not nonparametrically identifiable as a function of the observed data distribution \citep{nabi2022testability}, similar to the presence of self-censoring structure shown in Fig.~\ref{fig:nonignor}(a). See \cite{bhattacharya19mid} for sufficient conditions under which the target law is nonparametrically identifiable and \cite{nabi20completeness} for necessary and sufficient conditions under which the full law is nonparametrically identifiable, in a given mDAG. 

%##############################################
\section{Identification results}  
\label{sec:ID}
%##############################################

\subsection{Nonparametric identification}
\label{subsec:np-ID}

\cite{bhattacharya19mid} proved that the conditional density of $p(R_y  \mid R_x = 0, X)$ is not nonparametrically identifiable in the criss-cross model. This directly implies that the full law is not nonparametrically identified as a function of the observed data law. \cite{nabi2022testability} further proved that the target law is not identified either by providing a counterexample using binary variables for $X$ and $Y$. We verify the lack of nonparametric identification of the target law in Appendix~A, using continuous variables following normal distributions. 

The conditional distribution $p(X \mid Y)$ is, however, nonparametrically identified. This is because using the independence assumptions in display~(\ref{eq:criss_cross_assump}) and Bayes rule, we can write: 
\begin{align*}
    p(X \mid Y) = p(X \mid Y, R_x = 1) = \frac{p(X, Y, R_x=1)}{\int p(x, Y, R_x = 1) dx}, 
\end{align*}
where the marginal distribution $ p(X, Y, R_x = 1)$ equals: 
\begin{align*}
    %p(X, Y, R_x = 1) = 
    \frac{p(X, Y, R_x = 1, R_y = 1)}{p(R_y =1 \mid R_x = 1, X, Y)} = \frac{p(X, Y, R_x = 1, R_y = 1)}{p(R_y =1 \mid R_x = 1, X)}, 
\end{align*}
and thus it is identified. 
The probabilistic operation of taking the full law and dividing it by the conditional density of $p(R_y \mid \pa_{\cal G}(R_y))$ (evaluated at $R=1$) corresponds to an intervention on $R_y$ that sets it to one. This provides an intuitive inverse probability weighting estimation strategy for parameters involving the conditional density of $X$ given $Y$. See Section~\ref{subsec:est_gmm} for a discussion on estimation and \cite{nabi2022causal} for more details on the interventional view to identification in graphical models of missing data.  

We take advantage of the nonparametric identification of $p(X \mid Y)$ in two ways: one is by combining this knowledge with consideration of a class of exponential family distributions to provide sufficient conditions for the identification of target and full laws (Section~\ref{subsec:par-ID}), and the other is by exploiting the knowledge in $p(X \mid Y)$ to estimate the odds ratio between $X$ and $Y$ as a method of an independence test, using either a conditional likelihood approach (Section~\ref{subsec:est_order}) or a generalized estimating equation (GEE) approach (Section~\ref{subsec:est_gmm}). 


\subsection{Parametric identification}
\label{subsec:par-ID}

We first consider identification of the target law $p(X, Y)$ when $X$ is assumed to be univariate. We generalize our identification results to multivariate $X$ in Section~\ref{sec:high-dim}. 

\subsubsection{Target law identification}
Assume $p(X)$ and $p(Y \mid X)$ belong to the exponential family distribution. That is, 
% {\small 
\begin{align}
    &p(x) \sim \exp\left\{\frac{x\eta_x-b_x(\eta_x)}{\Phi_x}+c_x(x;\; \Phi_x)\right\}
     \label{eq:par_model} \\
    &p(y \mid x) \sim \exp\left\{\frac{y\eta-b(\eta)}{\Phi} \!+\! c(y;\; \Phi)\right\}, g(\mu(\eta)) \!=\! \alpha \!+\! \beta x, \nonumber 
\end{align}
% }%
where $b, c, b_x, c_x$ are known functions, $\Phi, \Phi_x > 0$ are dispersion parameters that may be known or unknown, and $g$ is a known one-to-one, third-order continuously differentiable link function. Let $\mu(\eta) \coloneqq \E[Y | X]$ and $\mu_x(\eta_x) \coloneqq \E[X]$. From the exponential family theory, we know that $b^{\prime}(\eta)=\mu(\eta)$ and $b^\prime_x(\eta_x) = \mu_x$. If $\mu = g^{-1}$, then $g$ is called the canonical link function and is denoted by $g_c$. We outline sufficient conditions for identifying the parameter vector $\theta = (\alpha, \beta, \Phi, \eta_x, \Phi_x)$ in the following theorem. 

%theorem 1
\begin{theorem}\label{thm:id-par}
Assume the model in display (\ref{eq:par_model}) and $X$ takes $k+1$ distinct values $x_0, x_1,\cdots,x_k$. Let $\varphi=[g\circ \mu]^{-1}$, $\zeta=b\circ\varphi$. Define the following equations: 
\begin{align*}
    \phi_i(\theta) &= \{\varphi(\alpha+x_i\beta)-\varphi(\alpha+x_0\beta)\}/{\Phi} \\ 
    \zeta_i(\theta) &= \frac{-\zeta(\alpha+x_1\beta)+\zeta(\alpha+x_0\beta)}{\Phi}+\frac{\eta_x(x_1-x_0)}{\Phi_x} \\ 
    & +c(x_1;\;\Phi_x)-c(x_0;\;\Phi_x). 
\end{align*}
Define the Jacobian matrix $J={\partial(\Phi,\,Z)}/{\partial \theta }$, where $\Phi=\{\phi_1,\dots, \phi_k\}$ and $Z=\{\zeta_1,\dots,\zeta_k\}$. Under regularity conditions (detailed in Appendix~B.1), the target law $p(X,Y)$ is identifiable if 
\begin{align*}
    (i) \ k\geq dim(\theta), \quad
    (ii) \ \text{Jacobian matrix } J \text{ has full rank.}
\end{align*}
\end{theorem}%
See Appendix~B.1 for a proof. To provide an insight into Theorem~\ref{thm:id-par}, we emphasize the following observation: for any two distinct points of $X$, say $x_1$, and $x_0$, we have
\begin{align}\label{eq:key}
 \frac{p(x_1\mid y)}{p(x_0\mid y)}=\frac{p(y\mid x_1)}{p(y\mid x_0)}\times\frac{p(x_1)}{p(x_0)}.  
\end{align}
The left-hand side of equation (\ref{eq:key}) is identified, therefore as we vary the choice of distinct points of $X$, we are getting a series of equations that connect the identified conditional distribution $p(X\mid Y)$ to the target law. The rank of the Jacobian matrix $J$ provides a quantitative measure for the amount of information about the target law that is reflected in the conditional distribution $p(X\mid Y)$. When $J$ is full rank, we are able to obtain a unique solution of the target law, as a function of observed data law, by solving a system of equations. In the case of $J$ being rank deficient, we observe that removing some columns of $J$ can lead $J$ to be full rank. Removing columns from $J$ has the interpretation of assuming the corresponding parameters to be known, which yields sufficient conditions for identification claims. A similar argument is made by \cite{zhao2015semiparametric} in the non-ignorable non-response model (a.k.a. self-censoring) where $X$ is assumed to be fully observed and the parametric marginal density of $X$ is known. 

We highlight that our identification framework is highly generalizable. As the dimensionality of the distribution increases, the core of the theorem remains unchanged. We delve into the generalization of Theorem \ref{thm:id-par} thoroughly in Section \ref{sec:high-dim}. In addition, while the proposed method is not limited to the exponential family distributions, our emphasis on this particular family allows for clear and concise identification characterizations. We will further demonstrate in Section \ref{subsec:full-law} that the full law identification is easier to establish within the exponential family. 

In Appendix~C, we show the utilization of Theorem \ref{thm:id-par} in establishing sufficient conditions for target law identification in widely used exponential family distributions, including normal, Bernoulli, exponential, and Poisson distributions with either canonical or inverse links. The second condition in Theorem~\ref{thm:id-par}, namely that the Jacobian matrix must be of full rank, has different implications on what specific knowledge is required for $\theta$ in advance. For instance, under normal distributions with an inverse link discussed in Appendix~C.2 or exponential distributions discussed in Appendix~C.7, the target law is identified without any further restrictions on the parameter vector $\theta$. While in certain other distributions, the full-rank requirement of the Jacobian matrix implies that part of $\theta$ must be known apriori. For instance, in bivariate normal distributions with a canonical link discussed in Appendix~C.1, it is essential for identification arguments that at least the marginal mean of either $X$ or $Y$ is known. We emphasize that Theorem \ref{thm:id-par} only provides sufficient, not necessary, identification conditions. This means that stronger-than-needed characterizations might be established. 

\subsubsection{Full law identification}\label{subsec:full-law}

Under the conditions of Theorem~\ref{thm:id-par}, we can use the joint factorization of the full law in the criss-cross model to show that the conditional density of $R_x$ given $Y$, a.k.a. the propensity score of $R_x$, is identified: $p(X, Y, R_x=1, R_y=r_y) = p(X, Y) \times p(R_x=1 \mid Y) \times p(R_y=r_y \mid X, R_x=1)$, for $r_y = 0, 1$. To fully identify the full law, we need to show whether the full law evaluated at $R_x = 0$, i.e., $p(X, Y, R_x=0, R_y=r_y)$, is identified or not, or equivalently whether or not the propensity score of $R_y$ evaluated at $R_x = 0$, i.e., $p(R_y = 1 \mid R_x=0, X)$, is identified. The question of full law identification translates into the nonexistence of any two distinct propensity scores for $R_y$, e.g.,  $p_1(R_y\mid X, R_x)\neq p_2(R_y\mid X, R_x)$, such that $\int  \left[ \ p_1(R_y=1 \mid R_x=0, x) \ - \ p_2(R_y=1 \mid R_x=0, x) \ \right]$ $p(x \mid Y) \ dx=0$. Let $h(X) = p_1(R_y=1 \mid R_x=0, X)-p_2(R_y=1 \mid R_x=0, X)$. This condition then implies that if $\E[h(X) \mid Y] = 0$, then it must be the case that $h(X) = 0$ for the full law to be identified. This relates to the \textit{completeness} condition described below. 
\begin{condition}\label{cond:completeness}
For any function $h(X)$ with finite mean, $\E\{h(X)\mid Y\}=0$ implies $h(X)=0$ almost surely.
\end{condition}
\noindent With the completeness condition introduced, we can establish identification of the full law as follows. 
\begin{lemma}\label{lemma:full_law}
    Given the conditions in Theorem~\ref{thm:id-par} and  Condition~\ref{cond:completeness}, the full law $p(X,Y,R_x,R_y)$ is identified. 
\end{lemma}
See Appendix~B.3 for a proof. Identification under the completeness condition is widely seen among previous works \citep{newey2003instrumental,miao2015identification,zhao2022versatile}. As a special case, full law identification can be established from the completeness property of the exponential family distributions. More specifically, Condition \ref{cond:completeness} is guaranteed to hold if $p(X \mid Y)$ takes the following form:
\begin{align*}
    p(X \mid Y)=s\left(X\right) t(Y)\exp\left[\mu(Y)^{T}\tau\left(X\right)\right], 
\end{align*}
where $s\left(X\right)>0$, $\tau\left(X\right)$ is one-to-one in $X$, and the support of $\mu(Y)$ is an open set.

We show that the specific examples discussed in Appendices~C.1, C.3, C.4, C.5, and C.6 all have $p(X\mid Y)$ lie in the exponential family, therefore the full law is guaranteed to be identified (under conditions outlined in Theorem~\ref{thm:id-par}). In examples discussed in Appendices~C.2 and C.7, $p(X\mid Y)$ falls out of the exponential family, therefore the full law may or may not be identified. 

%##############################################
\section{Estimation and inference}  
\label{sec:estimation}
%##############################################

Our primary target of inference is the odds ratio between $X$ and $Y$, denoted by $\text{OR}(X, Y)$ and defined in (\ref{eq:odds_ratio}). Since the conditional density $p(X \mid Y)$ is nonparametrically identified, this odds ratio is also nonparametrically identified. In order to estimate this parameter, we establish two semiparametric methods outlined below.
Hereafter, we use $n$ to denote the size of the completely observed samples and $N$ the size of all samples.

\subsection{Conditional likelihood with order statistics}\label{subsec:est_order}

We assume access to $n$ i.i.d. copies of observed random variables $(X, Y),$ making up the data set $(x_1, y_1), \ldots, (x_n, y_n)$. As our first approach to estimating $\text{OR}(X,Y)$, we adopt the conditional likelihood approach based on order statistics $\widetilde{x}=(x_{(1)}, \ldots, x_{(n)})$. Let ${\cal P}$ collect all $n!$ permutations of $\{1, \ldots, n\}$. For a given permutation $P$ in ${\cal P}$, let $P(i)$ denote the $i$-th element of $P$. Consider the conditional likelihood $\prod_{i=1}^n p(x_i \mid y_i, r_{x_i}=1, r_{y_i}=1, \widetilde{x})$, which equals 
\begin{align}
    &\frac{\prod_{i=1}^{n} \  p\left(x_i \mid  y_i,r_{x_i}=1, r_{y_i}=1\right)}{\sum\limits_{P \in {\cal P}} \ \prod_{i=1}^{n} \ p\left(x_{P(i)}\mid y_i, r_{x_i}=1, r_{y_i}=1\right)} \notag \\ 
    &\hspace{0.75cm} =\frac{\prod_{i=1}^{n} p\left(x_{i} \mid y_{i}\right)}{\sum\limits_{P \in {\cal P}} \ \prod_{i=1}^{n} \ p\left(x_{P(i)} \mid y_{i}\right)}.\label{eq:pseudo-lik}
\end{align}
% We are able to drop the $\widetilde{x}$ in the numerator given that $(x_1,\cdots,x_n)$ exactly implies $\widetilde{x}$.
The last equality holds since by Bayes rule, we have: 
\begin{align*}
    &p(x_i \mid y_i, r_{x_i}=1, r_{y_i}=1) \\
    &\hspace{0.75cm} = \frac{p(x_i \mid y_i) \ p(r_{x_i}=1, r_{y_i}=1 \mid x_i, y_i)}{p(r_{x_i}=1, r_{y_i}=1 \mid y_i)}, 
\end{align*}
and given the mDAG factorization we can write $p(r_{x_i}=1, r_{y_i}=1 \mid x_i, y_i)$ as $p(r_{x_i}=1 \mid y_i) \ p(r_{y_i}=1 \mid r_{x_i}=1, x_i).$ We can rewrite $p\left(x_{P(i)} \mid y_{i},r_{x_i}=1,r_{y_i}=1\right)$ in a similar way. The terms related to the missingness mechanism remain invariant under permutations and cancel out from the numerator and denominator.   

% As our first approach to estimating $\text{OR}(X,Y)$, we adopt the conditional likelihood approach based on order statistics $\widetilde{x}=(x_{(1)}, \ldots, x_{(n)})$. Let $x=(x_1,\cdots,x_n)$, $x_P$ be the permutation of $x$ and $x_{P_{(i)}}$ be the $i$-th element of $x_P$.
% Consider the following conditional likelihood $p(x\mid y_1\cdots,y_n, r_{x_1}=r_{y_1}=1,\cdots,r_{x_n}=r_{y_n}=1)$ which equals
% \begin{align}
% &\frac{p(x,\widetilde{x}\mid y_1\cdots,y_n, r_{x_1}=r_{y_1}=1,\cdots,r_{x_n}=r_{y_n}=1)}{p(\widetilde{x}\mid y_1\cdots,y_n, r_{x_1}=r_{y_1}=1,\cdots,r_{x_n}=r_{y_n}=1)}\notag\\
% &\overset{i.i.d}{=}\frac{\prod_{i=1}^{n} p\left(x_i\mid  y_i,r_{x_i}=r_{y_i}=1\right)}{\sum\limits_{x_P:\text{permutation of }x} \prod_{i=1}^{n} p\left(x_{P_{(i)}}\mid y_i, r_{x_i}=r_{y_i}=1\right)}\notag\\
% &=\frac{\prod_{i=1}^{n} p\left(x_{i} \mid y_{i}\right)}{\sum\limits_{x_P:\text{permutation of }x} \prod_{i=1}^{n} p\left(x_{P_{(i)}} \mid y_{i}\right)},\label{eq:pseudo-lik}
% \end{align}
% where the summation is over all possible permutations of $x$. We are able to remove the $\widetilde{x}$ in the numerator given that $x$ exactly implies $\widetilde{x}$. Equality~\ref{eq:pseudo-lik} is motivated by the fact that for any realizations of random variables $X$ and $Y$, denoted by $x$ and $y$, $p(x\mid y,r_x=1,r_y=1)$ equals
% \begin{align}
% \frac{p(r_x=1, r_y=1\mid y,x)}{p(r_x=1, r_y=1\mid y)} \ p(x\mid y),\notag
% \end{align}
% where $p(r_x=1, r_y=1\mid y,x)=p(r_y=1\mid r_x=1,x)p(r_x=1\mid y)$ is a product of a function of $y$-only and a function of $x$-only, and $p(r_x=1, r_y=1\mid y)$ is a function of $y$-only. Hence, these nuisance functions remain invariant under permutations and are subsequently cancelled out between the numerator and denominator.

By exploiting the information available in this conditional likelihood, it is possible to estimate some parameters, such as the odds ratio, in the model of $p(X\mid Y)$.
The nice feature of applying this conditional likelihood is that for each subject $i$, the corresponding terms $p(R_x=1, R_y=1\mid Y,X)$ and $p(R_x=1, R_y=1\mid Y)$ are all canceled out during the above derivations; therefore, this conditional likelihood approach is robust to the model misspecification of the propensity scores, i.e., neither $p(R_y=1\mid R_x=1,X)$ nor $p(R_x=1\mid Y)$ need to be correctly specified in order to have a consistent estimation of the odds ratio.

Since the above conditional likelihood has the computation complexity of order $n!$, in reality, we approximate the conditional likelihood with the following pairwise pseudo-likelihood
\begin{align*}
&\prod_{i<k} \frac{p\left(x_{i} \mid y_{i}\right) p\left(x_{k} \mid y_{k}\right)}{p\left(x_{i} \mid y_{i}\right) p\left(x_{k} \mid y_{k}\right)+p\left(x_{i} \mid y_{k}\right) p\left(x_{k} \mid y_{i}\right)} \\
&\hspace{0.75cm} =\prod\limits_{i<k} \frac{1}{1+Q\left(x_{i}, y_{i} ; x_{k}, y_{k}\right)},
\end{align*}%
where $Q\left(x_{i}, y_{i} ; x_{k}, y_{k}\right)$ is the inverse of odds ratio (OR) and equals
$$\{p\left(x_{i} \mid y_{k}\right) p\left(x_{k} \mid y_{i}\right)\}/\{p\left(x_{i} \mid y_{i}\right) p\left(x_{k} \mid y_{k}\right)\}.$$
Therefore, by analyzing the completely observed subjects from the biased sample $p(X\mid Y,R_x=1,R_y=1)$, 
we are able to estimate the odds ratio $\text{OR}$ between $X$ and $Y$.
This conditional likelihood approach was first proposed in \citep{kalbfleisch1978likelihood} for hypothesis testing and then was used in a variety of statistical problems including both parameter estimation \citep{liang2000regression} and variable selection \citep{zhao2018penalized}; see \citep{chen2021semiparametric} for a more comprehensive exposition.

To illustrate the above pairwise pseudo-likelihood, we first consider a special case that $X\mid Y\sim \N(\alpha+\beta Y,\sigma^2)$, then 
$$\text{OR}=\exp\left(\frac{\beta}{\sigma^2}(x_i-x_k)(y_i-y_k)\right)=\exp \left[\frac{\beta}{\sigma^2}\left(w_j v_j\right)\right],$$
where $w_j=-\operatorname{sign}\left(y_i-y_k\right) \text { and } v_j=(x_i-x_k)\left|y_i-y_k\right|$, $j=1,\ldots,n(n-1)/2$ corresponds to each pair of $(i,k), i,k=1,\ldots,n$. Hence, the logarithm of the above pairwise pseudo-likelihood can be written as
\[
-\sum_{j} \log \left\{1+\exp \left[\frac{\beta}{\sigma^2}\left(w_j v_j\right)\right]\right\}.
\]
Thus, one can obtain the estimate of the parameter $\frac{\beta}{\sigma^2}$, denoted as $\theta$ hereafter, by performing the logistic regression with response $u_k$ and covariate $v_k$ without the intercept term, where
\[
u_k= \begin{cases}1 & \text { if }y_i-y_k>0 \\ 0 & \text { if } y_i-y_k<0.\end{cases}
\]
It is worth noting that the unknown parameter in OR pertains solely to the ratio of $\beta/\sigma^2$. As a result, this estimation approach accommodates potential misspecification of $\alpha$, which is the intercept in the regression $\E(X\mid Y)$, leading to a more comprehensive semiparametric assumption for the relationship between X and Y.

Let $\widetilde{\theta}$ denote the parameter estimate. Our result below demonstrates the asymptotic normality of $\widetilde{\theta}$.
\begin{theorem}
\label{thm:est-order}
  Denote $Q(x_i,y_i;x_k,y_k;\theta) \!=\! Q_{ik}(\theta)$ and $\zeta_{ik}(\theta) \!=\! \partial\log\{1+Q_{ik}(\theta)\}/\partial\theta$. Assume that $\E\|\zeta_{12}(\theta)\|^2<\infty$ for any $\theta$ in the parameter space. Then, 
  \begin{align*}
  \sqrt{N}(\widetilde\theta-\theta_0) \xrightarrow{d} \N(0, A^{-1}BA^{-1}),
  \end{align*}
  where $A \!=\! \E\left\{R_{x_1}R_{y_1}R_{x_2}R_{y_2}\partial\zeta_{12}(\theta_0)/\partial\theta\right\}$ and $B \!=\! 4\E\left\{R_{x_1}R_{y_1}R_{x_2}R_{y_2}R_{x_3}R_{y_3}\zeta_{12}(\theta_0)\zeta_{13}(\theta_0)\right\}$.
\end{theorem}
See Appendix~D.1 for a proof. The aforementioned pairwise pseudo-likelihood is favorable under a large sample size given its computational efficiency. However, the pairwise pseudo-likelihood estimator is generally inefficient. To improve efficiency, groupwise pseudo-likelihood can be adopted. Instead of picking two observations at a time, groupwise pseudo-likelihood uses more than two observations as a group. For example, with a group size of three, we will have
\begin{align*}
    L \propto \!\! \prod_{i<j<k} \frac{p(x_i\mid y_i) \ p(x_j\mid y_j) \ p(x_k\mid y_k)}{\sum_P p(x_{P(i)}\mid y_i) \ p(x_{P(j)}\mid y_j) \ p(x_{P(k)}\mid y_k)}
\end{align*}
where $P$ is the permutation of $(i,j,k)$.
Increased group size gives better efficiency with the cost of computational time. The final choice of group size should base on the consideration of computational time and statistical efficiency. Computational techniques with adaptive Monte Carlo approximation and Metropolis algorithm for directly maximizing the conditional likelihood are also well established and can be found in Chapter 4 of \cite{chen2021semiparametric}.

%+++++++++++++++++++++++++

\subsection{Generalized estimating equations}\label{subsec:est_gmm} 

In the estimation approach presented in Section~\ref{subsec:est_order}, we need to specify the conditional density function $p(X\mid Y)$ either fully parametrically or semiparametrically. Alternatively, the model $p(X\mid Y)$ can be semiparametrically specified. For instance, assuming $\E(X\mid Y)=h(Y;\theta)$ with $h(\cdot)$ a known function and $\theta$ the unknown parameter of interest, we have the following estimating equation
\begin{align*}
	&\E \Big[  \frac{R_x \times R_y}{\pi(X)} \times f(Y) \times  (X - E(X \mid Y))  \Big] = 0,
\end{align*}
for any arbitrary function $f(Y)$.
Hereafter, we denote $\pi(X)=p(R_y = 1 \mid R_x = 1, X)$.
Note that the model $\pi(X)$ does not involve any missing data, so any off-the-shelf statistical method can be applied to model $\pi(X)$.
To better illustrate our proposed method, we do not particularly discuss the method for estimating $\pi(X)$ here.

Thus, the estimator of the parameter $\theta$, denoted as $\widehat\theta$, can be obtained by solving the following empirical version of the estimating equation
\[
\frac1N\sum_{i=1}^N \frac{R_{x_i} \times R_{y_i}}{\pi(x_i)} \times f(y_i) \times  (x_i - h(y_i;\theta)) = 0.
\]
In the following, we develop the asymptotic normality of the estimator $\widehat\theta$.
In particular, we also identify the optimal choice of $f(y)$, denoted by $f_{opt}(y)$, such that it achieves the best possible estimation efficiency among all choices of arbitrary function $f(y)$.
For simplicity, we denote $\Psi(X, Y, R_{x}, R_y ; \theta)=\frac{R_x \times R_y}{\pi(X)}\times f(Y) \times (X - h(Y;\theta))$.

\begin{theorem}
    \label{thm:est-gmm}
  Assume that $\E\|\Psi(X, Y, R_{x}, R_y ; \theta)\|^2<\infty$ for any $\theta$ in the parameter space. Then,
  \begin{itemize}
    \item[(a)] For any function $f(Y)$, we have
    \begin{align*}
        \sqrt{N}(\widehat\theta-\theta_0) \xrightarrow{d} \N(0, C^{-1}D(C^{-1})^T),
    \end{align*}
    where 
    \begin{align*}
        D &= \E\left\{\frac{R_x R_y}{\pi(X)^2}(X-h(Y;\theta))^2f(Y)f(Y)^T\right\}, \\
        C &= \E\left\{\! \frac{R_x R_y}{\pi(X)}a(Y)f(Y)^T\! \right\}, \text{ and } \\
        a(Y) &= \frac{\partial h(Y;\theta)}{\partial \theta} \! \bigg\rvert_{\theta=\theta_0}. 
    \end{align*}
    \item[(b)] The optimal choice of $f(Y)$ is
    \begin{align*}
        f_{opt}(Y)=\left[\E\left\{\frac{(X-h(Y;\theta))^{2}}{\pi(X)}\mid Y\right\}\right]^{-1} a(Y).
    \end{align*}
  \end{itemize}
\end{theorem}

See Appendix~D.2 for a proof.

%+++++++++++++++++++++++++

\subsection{Alternative estimation targets} 

In addition to the associational relation between $X$ and $Y$, one might be interested in testing additional model assumptions, e.g., whether the missingness of $X$ is indeed influenced by $Y$ or not. This can be easily set up by rewriting the propensity score of $R_x$ using a parameterization that encodes the odds ratio between $R_x$ and $Y$ as $p(R_x = 1 \mid y) = \{1 + \exp(\lambda + \eta(y))\}^{-1}$ where $\eta(y) \coloneqq \log(\text{OR}(R_x = 0, y))$ and $\lambda = \log[{p(R_x = 0 \mid y_0)}/{p(R_x = 1 \mid y_0)} ]$. Under the conditions of Theorem~\ref{thm:id-par}, $\eta(y)$ would be identified. Exploring detailed estimation strategies are left to future work. 
%See Appendix~E.1 for comments on estimation. 

It is worth pointing out that under the conditions of Theorem~\ref{thm:id-par} and Condition~\ref{cond:completeness}, one can simply estimate the entire parameter vector of the full law, assuming the parametric forms of the propensity scores in the missingness mechanism are known. 
More flexible estimation approaches are possible if one is willing to make additional modeling assumptions. For instance, in addition to independence restrictions in display~(\ref{eq:criss_cross_assump}), we may assume $p(R_y = 1 \mid R_x, X)$ is not a function of $X$ when $R_x = 0$. This reduces down the criss-cross model to the permutation MNAR model proposed by  \cite{robins97non-a}, where the full law is nonparametrically identified and the model is nonparametrically saturated, i.e., it imposes no restriction on the observed data law. In this case, we can proceed with nonparametric influence function based estimation, as discussed in Appendix~E. 


%##############################################
\section{Multidimensional \ $\mathbf{X}$}
\label{sec:high-dim}
%##############################################

We now discuss how our identification arguments can be easily generalized to higher dimensional vector spaces. For a reasonable representation of sampling distributions, we extend Theorem~\ref{thm:id-par} to instances where $X$ follows either a multivariate normal or a multinomial distribution. The corresponding identification theories under these two scenarios are provided in Appendix~B.2; generalization to other sampling distributions can be carried out in a similar fashion. 

As two special cases, we consider $X$ to follow a multivariate normal or a multinomial distribution while $Y\mid X$ follows a normal distribution under the canonical link. We assume that the first condition in Theorem~\ref{thm:id-par} is satisfied by having sufficient observations. 
\begin{example} ($X$ is multivariate normal and $Y\mid X$ is normal under canonical link) \ Suppose 
\begin{align*}
    X \sim \N_d(\mu, \Sigma), 
    \qquad 
    Y \mid X \sim \N(\alpha + X^{T} \beta,\Phi). 
\end{align*}%
Assume the nuisance parameter $\Sigma$ is known. The unknown vector of parameters is $\theta=(\alpha, \beta, \Phi, \mu)$. A sufficient condition for identification of the target law $p(X,Y)$ is for the intercept $\alpha$ to be known. According to Lemma~\ref{lemma:full_law}, the full law is also identified.
\end{example}

\begin{example} ($X$ is multinomial and $Y\mid X$ is normal under canonical link) \ Suppose 
\begin{align*}
&X \sim \operatorname{Multinomial}_d(n, p), \quad Y \mid X \sim \N(\alpha + X^T\beta,\Phi),
\end{align*}%
where $p=(p_1,\ldots, p_d)$ is the vector of event probabilities, and $n$ is the number of trials. We can write $p(x)=\exp [x^T\eta+c(x)]$ where $\eta = \left(\log p_1, \ldots, \log p_d\right), c(x)=\log \frac{n !}{x_{1} ! \ \ldots \ x_{d} !}.$ Assume $n$ is known. The unknown vector of parameters is $\theta=(\alpha, \beta, \Phi, \eta)$. A sufficient condition for identification of the target law $p(X,Y)$ is for the intercept $\alpha$ to be known, or knowing at least one element of $\eta$. According to Lemma~\ref{lemma:full_law}, the full law is also identified.
\end{example}


%##############################################
\section{Experiments}
\label{sec:exp}
%##############################################

\subsection{Simulations}
\label{sec:sims}

We now compare the finite sample behavior of the three proposed estimation strategies, namely (i) non-optimal GEE, (ii) optimal GEE, and (iii) conditional likelihood with order statistics.\footnote{R code can be found at \url{https://github.com/annaguo-bios/criss-cross-model-code}. } We conduct simulation studies of $(X,Y)$ following bivariate normal distribution 
% {\small 
\begin{align*}
    \left(\begin{array}{l}Y \\ X\end{array}\right) \sim \quad \N\left[\left(\begin{array}{l}\mu_{1} \\ \mu_{2}\end{array}\right),\left(\begin{array}{cc}\sigma_{1}^{2} & \rho \sigma_{1} \sigma_{2} \\ \rho \sigma_{1} \sigma_{2} & \sigma_{2}^{2}\end{array}\right)\right],
\end{align*}
% }%
with $\mu_1=2,\,\mu_2=0.4,\,\sigma_1=1,\,\sigma_2=3,\,\rho=0.3$. The missingness mechanism is set as follows: 
\begin{align*}
    &p(R_x=1\mid Y)= \operatorname{expit}(-0.5+Y), \\
    &p(R_y=1 \mid X,R_x)= \operatorname{expit}(2-R_x+0.7X). 
\end{align*}

\begin{figure*}[!t]
    \centering
    % \includegraphics[scale=0.3]{OR_est_plot.png}
    \includegraphics[height=5.2cm, width=16cm]{OR_est_plot.png}
    \caption{Odds ratio estimation with varying sample size.}
    \label{fig:sample_size}
\end{figure*}

\input{table1.tex}

Under this setup, approximately $5\%$ of observations have both $X$ and $Y$ missing, $16\%$ of observations have $X$ missing and $Y$ observed, $25\%$ of observations have $X$ observed and $Y$ missing and $54\%$ of observations have both $X$ and $Y$ observed. Under the above setup, we have 
\begin{align*}
    &X\mid Y\sim \N(\alpha+\beta Y,\sigma^2)=\N(-1.4+0.9Y,8.19)\\
    &\text{OR}=\exp\big\{\frac{\beta}{\sigma^2}(x_i-x_k)(y_i-y_k)\big\}. 
\end{align*}
Assuming the nuisance parameters $\sigma_1,\sigma_2$ are known, we aim at estimating $\alpha$ and $\beta$ with non-optimal and  optimal GEE approaches. We further estimate the odds ratio when $(x_i-x_k)(y_i-y_k)=1$ using all three aforementioned methods. For non-optimal GEE, we choose $f(Y)=(1,Y)$. Note that for the optimal GEE, $f_{opt}(Y)$ might be a function of $\alpha,\beta$. In such scenarios, to construct $\widehat{f}_{opt}(Y)$, we utilize the estimated values $\widehat{\alpha}$ and $\widehat{\beta}$, obtained as medians over 100 simulation runs from the non-optimal GEE. 
All code necessary to reproduce our simulations is included with this submission. 
%All simulations are conducted with R version 4.2.2.

We evaluate the performance of our three proposed estimators based on three main criteria: (i) finite sample behavior as sample size increases, (ii) bias behavior as a result of model misspecification for $p(R_y=1\mid X,R_x=1)$, and (iii) efficiency behavior as a result of varying the correlation between $X$ and $Y.$ For each case, we conduct 100 simulation runs. The empirical comparisons for the second and third criteria are deferred to Appendix~F due to page limits. 

%%%%%%%%%%% case 1
%\noindent\underline{\textbf{Sample size.}} 
Figure \ref{fig:sample_size} illustrates how the odds ratio estimation varies across a range of sample sizes from $500$ to $4000$. In order to ensure a fair comparison across the three methods, we assume that the intercept $\alpha$ of $\E(X\mid Y)$ is known for both non-optimal and optimal GEEs. The results demonstrate that all three methods yield unbiased estimates with reduced estimation uncertainty as the sample size increases. The conditional likelihood estimators are less efficient followed by non-optimal GEE, especially when the sample size is small. Overall, all three methods provide comparable OR estimates with small bias, mean-squared error (MSE), and standard deviation (SD) when the sample size is large.

Apart from OR estimation, the GEE approach is also capable of estimating the intercept $\alpha$. Table \ref{table:1} compares the performance of the two GEEs for estimating $\alpha$ and $\beta$, in terms of bias, MSE, and SD. As expected, the results show that the optimal GEE method outperforms the non-optimal GEE method in terms of smaller SD, regardless of the sample size. Additionally, for small sample sizes, the optimal GEE exhibits smaller bias and MSE than the non-optimal GEE. For additional simulations, see Appendix~F. 

\subsection{Real data application}
\label{sec:real_data}

We implemented the proposed methods in a real-world scenario involving an obesity study, where the outcome variable is binary indicating obesity status. Specifically, we analyzed the Muscatine Coronary Risk Factor Study (MCRF) \citep{woolson1984analysis}, which collected data on obesity from 4856 school children in 1977, 1979, and 1981. Our objective was to estimate the obesity rates stratified by sex. For our analysis, we focused on the data from 1977 and 1981, where only $40\%$ of the records were complete for both years.

In our study, we defined $X$ as the indicator of obesity in 1977 and $Y$ as the indicator of obesity in 1981, with values 1 representing non-obesity and 2 representing obesity. We denoted the obesity rates as $\theta_{ij} \coloneqq p(X=i, Y=j)$, where $i$ and $j$ take values from the set ${1, 2}$. We accounted for the possibility of both $X$ and $Y$ having missing-not-at-random (MNAR) patterns. This considers the potential impact of extrapolative projections, such as how the likelihood of recording obesity indications at the current follow-up may be influenced by anticipated obesity (or its absence) in the future, or how inquiries about obesity history or forecasts at one time point can lead to additional inquiries at another time point. By accommodating MNAR mechanisms for both $X$ and $Y$, our model becomes more practical and applicable to real-world scenarios.

Based on our identification results, determining the complete joint distribution requires knowledge of one parameter from the set: ${\theta_{11}, \theta_{12}, \theta_{21}, \theta_{22}}$. In our analysis, we assumed that $\theta_{11}$ is known and obtained this value from the complete-case records. To estimate the obesity rates and the log odds ratio between $X$ and $Y$ (equivalent to $\log(OR) = \log(\theta_{11}\theta_{22}/\theta_{12}\theta_{21})$), we employed generalized estimating equations (GEE). Both optimal GEE and non-optimal GEE approaches were utilized.

For the non-optimal GEE, we set $f(Y)$ as $(1, Y)$ based on Theorem~\ref{thm:est-gmm}. Additionally, we employed pseudo-likelihood estimation for estimating the $\log(OR)$ parameter. To assess the precision of the estimates, we employed bootstrap resampling with 1000 replicates. The estimation results are presented in Table~\ref{table:binary}.

\input{table_realdata_binary}

The estimates obtained from the optimal GEE approach closely align with those from the non-optimal GEE, and therefore, they are not presented in the preceding analysis. The key findings reveal a substantial temporal correlation in obesity rates. Specifically, the non-obesity status exhibits a persistence rate of 0.723 for girls and 0.71 for boys between the two years. Both girls and boys have an equal probability of 0.118 of being obese in both years. Additionally, an intriguing observation for policy intervention purposes is that non-obese girls demonstrate a higher susceptibility to obesity compared to non-obese boys. This observation calls for further careful examination to facilitate effective strategies for obesity prevention. 

Furthermore, we also analyzed a real-world dataset related to income, where the outcome variable is continuous. The detailed findings are in Appendix~F. 

%##############################################
\section{Conclusions}  
\label{sec:conc}
%##############################################

In this paper, we considered a MNAR model which, like the self-censoring missingness mechanism, is an impediment to nonparametric identification of the complete-data distribution. We provided sufficient identification assumptions for both target and full laws by examining the rich class of exponential family distributions. We provided different semiparametric estimation strategies for computing parameters of the underlying joint distribution that can be used for pairwise independence tests and model selection purposes. An interesting avenue for future work is the exploration of a doubly-robust estimation theory that would enable the use of more flexible machine learning and statistical models in computing various model parameters.  

%##############################################
% \begin{contributions}

% \end{contributions}

\begin{acknowledgements} % will be removed in pdf for initial submission,
						 % (without ‘accepted’ option in \documentclass)
                         % so you can already fill it to test with the
                         % ‘accepted’ class option
    This work is partly supported by the National Center for Advancing Translational Sciences of the National Institutes of Health under Award Number UL1TR002378. 
\end{acknowledgements}


%##############################################
% References
% \clearpage
\bibliography{references}

\end{document}
