\documentclass[
  journal=proceedings,
  manuscript=article-type,
  year=2024
]{PMET_proc}

\usepackage{amsmath,bm}
\usepackage[nopatch]{microtype}
\usepackage{booktabs}
\usepackage{hyperref}
\usepackage{subcaption}
\usepackage{orcidlink}
\newcommand{\orcidauthor}[2]{#1~\orcidlink{#2}}

%\usepackage{amsmath, amssymb}
%\usepackage[nopatch]{microtype}
%\usepackage{booktabs}
%\usepackage{orcidlink}
%\newcommand{\orcidauthor}[2]{#1~\orcidlink{#2}}


\title{Imputation Models for Special Subpopulations in Large-scale Survey Assessments}
%\author{Usama S. Ali}
\author{\orcidauthor{Usama S. Ali}{0000-0002-2660-6049}}
\affiliation{ETS Research Institute, ETS, Princeton,  New Jersery, 08541, USA}
\email[Usama S. Ali]{uali@ets.org} %{usama.ali@edu.svu.edu.eg}
\alsoaffiliation{Department of Educational Psychology, South Valley University, Qena, 83253, Egypt}

\author{Frederic Robin}
\affiliation{ETS Research Institute, ETS, Princeton,  New Jersery, 08541, USA}
% \alsoaffiliation{Joint first authors}

\addbibresource{references.bib}

\keywords{missing data, latent regression models, literacy-related non-response} %% First letter not capped

\begin{document}

\begin{abstract}
Nonresponse in large-scale survey assessments can arise from factors such as language barriers, reading difficulties, or disabilities. Excluding these subpopulations may introduce bias into survey results. This study develops an imputation method for literacy-related nonresponse cases in the international adult survey (PIAAC). These cases completed a special background questionnaire—the doorstep interview—but did not proceed to the main cognitive assessment. Using such limited data from respondents across selected countries with varying proportions of such cases, we compared and evaluated multiple imputation models to improve proficiency estimation. The proposed approach provides a practical solution for enhancing inclusivity in educational measurement.
\end{abstract}

\noindent In this research, we explored the enhancement of reporting on special subpopulations in large-scale survey assessments with case study from an international adult survey. 

\section{Introduction}
Large-scale survey assessments, such as the International Association for the Evaluation of Educational Achievement's (IEA) Trends in International Mathematics and Science Study (TIMSS) and Progress in International Reading Literacy Study (PIRLS), the Organisation for Economic Co-operation and Development's (OECD) Programme for the International Assessment of Adult Competencies (PIAAC) and Programme for International Student Assessment (PISA), as well as national assessments like the U.S. National Assessment of Educational Progress (NAEP), are critical tools for evaluating skills, knowledge, and competencies across diverse populations \autocite{martin2020, oecd2019pisa, nces2022}. As participation in these assessments expands globally, the growing linguistic and cultural heterogeneity of test-takers presents unprecedented measurement challenges. A particularly pressing concern emerges when linguistic minorities, immigrant populations, and examinees with limited assessment language proficiency face test items that are linguistically or culturally inaccessible—resulting either in non-response patterns that produce missing data or in attempted responses that yield invalid measurement \autocite{rubin1996, vonhippel2020}. 
%A key issue arises when —such as linguistic minorities, immigrants, or individuals with limited proficiency in the assessment language—not only cannot fully engage with but cannot respond at all to cognitive test items, resulting in missing data or if they do may provide invalid or misleading data. Traditional approaches to handling these missing responses often exclude or misrepresent these groups, undermining the validity and equity of assessment outcomes \autocite{rubin1996, vonhippel2020}.
The challenge of accurately assessing the proficiency of such special subpopulations potentially compromises both the validity of cross-population comparisons and the equity of assessment outcomes. This paper addresses this critical issue in national and international assessment contexts. In the following sections, we introduce the methodology for estimating and reporting proficiencies in large-scale survey assessments, followed by a case study of a special subpopulation in an international survey of adult skills where different models referred to in this manuscript as ``imputation models'' were proposed and compared to estimate the proficiencies of individuals with language barrier. The findings of this study were discussed and conclusions section is followed.

\subsection{Plausible Values Methodology}

Most modern large-scale survey assessments employ plausible values (PVs) methodology to estimate respondent proficiency while accounting for measurement error and missing data \autocite{mislevy1991, vondavier2009}. This methodology can be summarized as a three-step process that combines:
\begin{enumerate}
    \item Item response theory (IRT) calibration of cognitive responses to estimate item parameters. Item parameters are estimated for each cognitive domain separately through unidimensional IRT models. Among these IRT models is the two-parameter logistic (2PL) model, where the probability correct response $X_j=1$ is given by
    \begin{equation}
    p(X_j=1|\theta) = \frac{\exp(\alpha_j(\theta - \beta_j))}{1+\exp(\alpha_j(\theta - \beta_j))}.
    \end{equation}
    
    \item A latent regression model that incorporates both responses from cognitive instruments and contextual variables from background questionnaires. This population-specific multivariate latent regression gives an expression for respondent’s proficiency distributions on the multidimensional scales conditional on covariates (i.e., contextual information, $\mathbf{y}$) in addition to the cognitive item responses ($\mathbf{x}$). Based on Bayes’ theorem, the posterior distribution of skills given the observed item responses and covariates is constructed as follows
    \begin{equation}
        P(\boldsymbol{\theta}_\nu|\boldsymbol{x}_\nu,\boldsymbol{y}_\nu,\boldsymbol{\Gamma},\boldsymbol{\Sigma}) \propto P(\boldsymbol{x}_\nu|\boldsymbol{\theta}_\nu) P(\boldsymbol{\theta}_\nu|\boldsymbol{y}_\nu,\boldsymbol{\Gamma},\boldsymbol{\Sigma}).
    \end{equation}
    
    This model estimates the regression coefficients ($\boldsymbol{\Gamma}$) and the residual variance-covariance matrix ($\boldsymbol{\Sigma}$) using the estimated item parameters from step 1 as true values \autocite{thomas1993}. 
    \item Multiple imputation where a specific number of PVs (e.g., 5 to 20) were generated for each respondent on each cognitive domain from the estimated posterior distributions of proficiency using estimated $\boldsymbol{\Gamma}$ and $\boldsymbol{\Sigma}$ from Step 2 \autocite{mislevysheehan1987, vondavier2009}.
    
\end{enumerate}

\subsection{The Challenge of Special Subpopulations in Reporting}
While effective for general populations, this methodology faces limitations when applied to special subpopulations with systematic non-response patterns. The latent regression model assumes missingness can be explained by observed covariates (i.e., missing at random assumption; \citeauthor{rubin1987}, \citeyear{rubin1987}), which may not hold for groups with language barriers where no cognitive data are provided and almost all contextual variables (or predictors) are often omitted from standard background questionnaires. 

The consequences of this limitation become evident when examining specific vulnerable groups:
\begin{enumerate}
    \item English Language Learners in NAEP: Despite accommodations, ELL students' scores often reflect language barriers rather than content knowledge \autocite{abedi2004}. Standard PV generation may underestimate their true abilities without proper linguistic covariates.
    \item Migrant Adults in PIAAC: First-generation immigrants frequently show non-response in literacy tasks, yet their occupational and educational backgrounds contain valuable information about latent proficiency \autocite{oecd2019}.
    \item Indigenous Students in TIMSS: When assessments aren't available in native languages, students may leave items blank, creating non-random missing patterns that standard PV approaches fail to address adequately \autocite{wu2009}.
\end{enumerate}

\section{The PIAAC Doorstep Interview Case Study}
%PIAAC's Measurement Challenge with Literacy-Related Nonresponse

%The Programme for the International Assessment of Adult Competencies (PIAAC), OECD’s flagship survey of adult skills, assesses literacy, numeracy, and adaptive problem-solving (APS) proficiency among non-institutionalized adults aged 16–65. A persistent challenge arises from nonresponse due to language barriers, reading/writing difficulties, or disabilities—with literacy-related nonresponse (LRNR) cases representing those specifically hindered by language limitations.

%In PIAAC's first cycle (2011–2018), LRNR respondents remained part of the target population and retained sampling weights but were excluded from proficiency reporting due to insufficient data. These cases typically provided only basic demographic information (e.g., estimated age and gender) without any cognitive assessment data. Analysis confirmed that the available background variables lacked the necessary predictive power to generate plausible values (PVs), forcing their exclusion from country-level proficiency estimates.

%This approach created two critical issues:

%Representation Gap: A substantively important population segment—adults unable to function in the workforce's primary languages—was omitted from national skill profiles.

%Upward Bias: Since excluded adults would presumably demonstrate low literacy in the tested languages, their absence artificially inflated reported national averages.

%The second cycle (2023) continues to face this measurement-equity tradeoff, highlighting the need for improved methods to account for linguistically excluded populations in international assessments.

PIAAC is OECD’s international survey of adult skills. In 2023, PIAAC in its second cycle measures adults’ proficiency in literacy, numeracy, and adaptive problem solving (APS). For PIAAC, the sample represents the non-institutionalized population, age 16 to 65. Nonresponse in this survey occurs due to language barrier, reading/writing difficulty or disability. Literacy-related nonresponse (LRNR) is a subset of these cases – those persons with a language barrier. In the first cycle of PIAAC, LRNR cases were part of the target popoulation and selected sample was given sampling weights, but no plausible values reported. These cases had very little background data – often only estimated age and gender - and no cognitive data. Analyses determined that there was no sufficient information to estimate their PVs. Accordingly, these cases were not included in the country-level proficiency estimates. This provide a reporting issue: A sector of the population was left out. Moreover, as that sector of the population could not function in any of the major languages needed to exercise their skills as part of the country's workforce, expectations were that in that context their skills should be low. Thus not including them in the country's population estimate leads to some over-estimation.

Recognizing these limitations, PIAAC in its second cycle pioneered an enhanced data collection protocol for literacy-related non-respondents. To improve population estimates in the second cycle of PIAAC, an ``abbreviated background questionnaire'' known as the doorstep interview was created. The strategy behind the doorstep interview was to provide a specific instrument delivered by an interviewer to collect targeted background information with high predictive power from respondents unable to speak the assessment language. The doorstep background variables were: 
\begin{enumerate}
    \item Gender
    \item Age
    \item Educational level
    \item Employment status
    \item Country of birth
    \item Number of years in the country if immigrant (i.e., not native-born) 
\end{enumerate}
These six variables embedded in the doorstep interview were also present in the full background questionnaire. Such doorstep interview was available in 28 PIAAC background questionnaire languages and 15 additional minority languages that participating countries selected the language(s) that would fit their minority groups. We also knew the full background questionnaire language is not their native language. This doorstep interview was administered if a translator was not available or if someone in the home cannot act as an interpreter; as it was always preferable to collect a full background questionnaire. This doorstep interview as an abbreviated background questionnaire was designed to potentially provide enough information to estimate PVs for these respondents.

Research demonstrated these variables significantly improved proficiency estimation for language-barrier populations \autocite{paccagnella2021}. By enriching the latent regression model with these carefully selected covariates, PIAAC achieved more accurate population estimates while maintaining the integrity of the PV framework. 

\subsection{Focus and Contribution of This Study}
In this study, we target different imputation models and compare their performance under the current features of the PIAAC assessment design to generate PVs across all cognitive domains for LRNR cases. The goal is to enhance the quality and inclusiveness of reporting by ensuring that no subpopulations are excluded. The study focuses on evaluating which imputation model best accounts for the unique characteristics of LRNR respondents while preserving the integrity and comparability of the assessment results.

This study makes three key contributions to the measurement literature:
\begin{enumerate}
    \item Methodological Extension: We developed and compared alternative IRT and latent regression approaches to model and generate plausible values for a small subpopulation that had only very limited data (i.e., few key background variables and no cognitive information) but a-priori expectations.
    %We compared latent regression IRT models that deals differently with language-barrier subpopulations, implemented through variable selection the contextual variables used along with or without the imputed cognitive data for these cases.
    \item Empirical Validation: focusing on PIAAC’s language-barrier subgroup (i.e. those who administered the doorstep interview), we demonstrated that it was possible to generate plausible values to:
    %Using PIAAC's doorstep interview data, we demonstrate how targeted covariate collection:
   \begin{itemize}
        \item Reduces bias in population parameters
        \item Improves the accuracy of proficiency estimates for non-respondents
        \item Maintains reliability compared to external benchmarks
    \end{itemize}
    \item Practical Framework: We provide guidelines for assessment programs to:
    \begin{itemize}
        \item Identify high-impact contextual variables during instrument development
        \item Implement adaptive data collection protocols for special subpopulations
        \item Integrate subgroup-specific modeling into standard PV methodology
    \end{itemize}    
\end{enumerate}

\section{Method}

\subsection{Data}
%Initial PIAAC main study data from four countries with variable percentages of DI cases was used in this research. Given that DI cases across all countries is less 2\%, two countries (C1 and C2) with relatively high DI percentage (above 2 and 5\%) were selected and two countries (C3 and C4) with relatively low DI percentage (below 1\%). See Table~\ref{table_data} for the selected countries with more information the number of cases as associated percentage of DI, those who failed both literacy and numeracy (Path 1 or P1), and those who failed the locator at least in one cognitive domain either literacy or numeracy (F1) cases. 

This study used initial PIAAC main study data from four countries, varying in their percentages of doorstep cases. Since doorstep interview cases accounted for less than 2\% of respondents across all countries, we selected two countries (C1 and C2) with higher doorstep interview rates (above 2\% and 5\%, respectively) and two (C3 and C4) with lower rates (below 1\%). Table~\ref{table_data} provides further details, including:
\begin{itemize}
    \item The number of doorstep interview cases and their percentages.
    \item Path 1 respondents who failed the Locator in both literacy and numeracy.
    \item Respondents who failed the locator in at least one domain (literacy or numeracy).
\end{itemize}

It is important to note that the theory is that, given that they have language barrier, they are expected to perform only at the level equivalent of someone with very low skill. As a consequence their path through the assessment (as described by Figure 1) would be equivalent to failing the locator (Path 1).

\begin{table}[hbt!]
\begin{threeparttable}
\caption{Selected participating countries for study}
\label{table_data}
\begin{tabular}{lrrc}
\toprule
\headrow Country ID & \multicolumn{3}{c}{Unweighted sample N (\%) of $...$ cases} \\
\cmidrule(lr){2-4}
 & DI\tnote{a} & P1\tnote{b} & F1\tnote{c} \\
\midrule
C1 & 234 (3.5) & 80 (1.2) & 228 (3.4)\\ 
\midrule
C2 & 897 (14.3) & 25 (0.5) & 300 (4.8)\\
\midrule
C3 & 35 (0.6) & 100 (1.6) & 619 (9.8)\\ 
\midrule
C4 & 36 (0.6) & 98 (1.5) & 633 (10.0)\\
\bottomrule
\end{tabular}
\begin{tablenotes}[hang]
\item[]Table note
\item[a] DI = Doorstep Interview
\item[b] P1 = Path 1 
\item[c] F1 = Failed at least in one domain
\end{tablenotes}
\end{threeparttable}
\end{table}


\begin{figure*}
\centering
\includegraphics[width=0.75\linewidth]{Figures/PIAAC_design.png}
\caption{PIAAC general assessment design  ( {\it Note. The horizontal dashed line indicates the cut score for Proficiency Level 1, set at 176.} )}
\label{fig_design}
\end{figure*}

\subsection{Targeted sample: Doorstep-like cases}

Sampled persons with language barrier nonresponse were presumed to have distinct proficiency distributions in the cognitive domains (in the assessment language) from regular individuals in the target population. The doorstep interview cases are operationally comparable to Path 1 cases (i.e., those failing both literacy and numeracy sections of the Locator), with most expected to perform at the lowest proficiency levels (i.e., Proficiency Level 1 or Below). We used Path 1 cases as a benchmark for doorstep interview case performance across models. By design, Path 1 cases are routed to basic items of reading and numeracy components, resulting in cognitive data that includes literacy and numeracy performance data without any APS data (see Figure~\ref{fig_design}). This Path 1 data limitation would prevent APS proficiency estimation if Path 1 cases were used exclusively in the imputation models. To include APS data, we extended the sample to create a doorstep-like sample (denoted F1 in Table~\ref{table_data}), comprising:
\begin{itemize}
    \item Path 1 cases (failing both domains)
    \item Cases failing exactly one domain (literacy or numeracy Locator)
\end{itemize}

\subsection{Imputation models in comparison}

%We examined three latent regression IRT models for generating PVs for doorstep interviewcases by varying these conditions:
%\begin{itemize}
%    \item Condition 1 (Respondent sample): All cases versus targeted cases
%    \item Condition 2 (Conditioning variables): Full background questionnaire (BQ) variables versus abbreviated doorstep interviewvariables
%    \item Condition 3 (Cognitive Data): No cognitive data versus imputed cognitive data for doorstep interviewcases
%\end{itemize}

We examined three alternative latent regression IRT models (i.e., imputation models), each using different dataset conditions, for estimating the model and generating PVs for the doorstep cases. The studied models were the following:
\begin{itemize}
    \item Model 1 (Base Model): Uses the full sample and full background questionnaire variables (current reporting methodology). Note that for doorstep interview cases, data for background variables other than the six doorstep-specific variables are missing by design. In the latent regression model, all non-doorstep interview background variables were coded with a "missing" category (e.g., gender includes three response categories: male, female, and missing), while the six available doorstep interview variables retained their actual values. This means the model conditioned estimates on both the known doorstep interview variables and the missing responses of other background variables.
    \item Model 2: Uses only doorstep-like cases (e.g., the target cases as defined and justified in the previous section) with abbreviated background questionnaire (or doorstep interview) variables
    \item Model 3: Uses the full sample but only  doorstep interview variables excluding all other contextual variables available in the full background questionnaire which means that for non-doorstep cases all these additional variables are turned off
\end{itemize}

Accordingly, the dataset conditions were defined by:
\begin{itemize}
    \item Respondent sample: All cases, or only the targeted cases (e.g., P1 or F1 cases)
    \item Conditioning variables: Full background questionnaire variables, or only the  doorstep interview variables
    \item Cognitive data for the doorstep interview sample: No cognitive data, or imputed cognitive data for doorstep interview cases
\end{itemize}


In the cognitive data imputation process, item scores were imputed (using single imputation) for all 16 literacy and numeracy locator items to mimic the responses provided by Path 1 respondents within each country. For each doorstep respondent, this was done by:
(a) drawing a proficiency value from the Path 1 posterior theta distribution (averaged across all Path 1 respondents' posteriors); and
(b) drawing correct or incorrect responses based on the drawn theta and the international IRT model for each item.
Therefore, both the background information from the doorstep interview and the imputed cognitive items were used in estimating the proficiencies for those doorstep interview cases.

\begin{table}[hbt!]
\begin{threeparttable}
\caption{Sample and conditioning variables used in the studied models}
\label{table_models}
\begin{tabular}{lcc}
\toprule
\headrow  & Full BQ\tnote{a} & Abbreviated DI\tnote{b} \\
\midrule
All Cases & Model 1 & Model 3\\ 
\midrule
Targeted Cases & N/A & Model 2\\
\bottomrule
\end{tabular}
\begin{tablenotes}[hang]
\item[]Table note
\item[a] BQ = Background Questionnaire 
\item[b] DI = Doorstep Interview
%\item[c] F1 = Failed at least in one domain
\end{tablenotes}
\end{threeparttable}
\end{table}

%Model 1 is the base imputation model that is currently used for reporting scores. This model used the full sample and full background variables. On the contrary of Model 1 is Model 2. Model 2 used a targeted sample of those doorstep-like cases and the doorstep interview variables. Model 3 used the full sample but only conditioning on doorstep interview variables. These three models can be used with or without imputing cognitive data to doorstep interview cases. 

The studied models are illustrated in Table~\ref{table_models}. Each model can be implemented with or without cognitive data imputation for doorstep interview cases. For cognitive data imputation, we used Path 1 respondents (failing in both literacy and numeracy sections of the Locator). Country-specific Path 1 posterior proficiency distributions were applied to impute responses for all Locator items (eight literacy and eight numeracy) for each doorstep interview case. The PIAAC assessment design is shown in Figure~\ref{fig_design}. 

\section{Results}

Figure \ref{fig_models123} provides the country-level proficiency mean plus and minus the standard deviation (+/- 1 SD) for doorstep interview cases with (right panels) and without imputed cognitive data (left panels) in Models 1, 2 and 3 for the three cognitive domains: literacy, numeracy, and APS. The results of comparing the three imputation models with and without imputation of cognitive data are summarized as follows:   
\begin{itemize}
    \item Model 1: As the base model for regular (non-doorstep) respondents, it produced extremely low scores for doorstep interview cases regardless of imputation. This was expected because Model 1 is unsuitable for doorstep interview cases due to extensive missing covariates (six available variables versus 240+ in the full background questionnaire). The severe missingness prevents reliable PV estimation for doorstep interview cases.  
    \item Model 2 with imputed cognitive data and Model 3 without imputed cognitive data yielded unsatisfactory results (e.g., higher performance for the doorstep cases exceeding proficiency level 1) because: 
        \begin{itemize}
        \item Model 2 with imputed cognitive data involves ``double dipping'' (i.e., using two features that would limit the performance of the doorstep interview cases: using a sample of doorstep-like cases in addition to imputing cognitive data based on Path 1 cases) and 
        \item Model 3 without imputed cognitive data fails to distinguish doorstep cases from other respondents  
        \end{itemize}
    \item Model 2 without imputed cognitive data: Produced reasonable results but required inclusion of higher-performing non-Path 1 cases to obtain APS data, which biased doorstep case estimates (based solely on demographics).  
    \item Model 3 with imputed cognitive data: Emerged as the recommended approach, providing:  
        \begin{itemize}
        \item More consistent cross-country/domain results  
        \item  Proper utilization of Path 1 proficiency distributions for imputation  
        \end{itemize}
\end{itemize}
%\begin{figure}[hbt!]
%\centering
%\includegraphics[width=0.6\linewidth]{Figures/LIT_no_imputation.png} \\
%\includegraphics[width=0.6\linewidth]{Figures/NUM_no_imputation.png} \\
%\includegraphics[width=0.6\linewidth]{Figures/APS_no_imputation.png}
%\caption{Proficiency Mean (+/- SD) for doorstep interview cases without imputation in Models 1, 2 and 3}
%\label{fig_no_imputation}
%\end{figure}

\begin{figure}[hbt!]
\centering
\begin{tabular}{cc}
\includegraphics[width=0.5\linewidth]{Figures/LIT_no_imputation.png} & \includegraphics[width=0.5\linewidth]{Figures/LIT_imputation.png} \\
\includegraphics[width=0.5\linewidth]{Figures/NUM_no_imputation.png} & \includegraphics[width=0.5\linewidth]{Figures/NUM_imputation.png} \\
\includegraphics[width=0.5\linewidth]{Figures/APS_no_imputation.png} &\includegraphics[width=0.5\linewidth]{Figures/APS_imputation.png} \\
\end{tabular}
\caption{Proficiency Mean (+/- SD) for Doorstep Interview Cases with and without Imputed Cognitive Data in Models 1, 2 and 3 ({\it Note. The horizontal dashed line indicates the cut score for Proficiency Level 1, set at 176.})}
\label{fig_models123}
\end{figure}

Both imputation models—Model 2 without imputed cognitive data and Model 3 with imputed cognitive data—yielded substantively reasonable results. Figure \ref{fig_DIpath1_Mod23} compares their performance in estimating proficiency distributions for doorstep interview cases relative to Path 1 cases, revealing two key insights:
    \begin{itemize}
        \item  Model 3 with imputed cognitive data produced proficiency distributions for doorstep interview cases that closely aligned with Path 1 cases, suggesting successful recovery of latent ability patterns through cognitive data imputation.  
        \item  Model 2 without imputed cognitive data showed divergent distributions, indicating that having no cognitive data leads to meaningfully different proficiency estimates.  
    \end{itemize}
These results demonstrate the value of incorporating cognitive data through imputation when analyzing incomplete assessment records.

%Two imputation models (Model 2 without imputation and Model 3 with cognitive data) provided very reasonable results. Figure \ref{fig_DIpath1_Mod23} shows how these two models performed comparing the performance of doorstep interview cases to that of Path 1 cases. These comparisons illustrated in Figure \ref{fig_DIpath1_Mod23} provided more insight about these two models where we found that the proficiency distribution of doorstep interview cases is similar to that of Path 1 cases under Model 3 with imputed cognitive data. This was not the case under Model 2 without imputation.

\begin{figure}[hbt!]
\centering
\includegraphics[width=0.6\linewidth]{Figures/recommended_LIT.png} \\
\includegraphics[width=0.6\linewidth]{Figures/recommended_NUM.png} \\
\includegraphics[width=0.6\linewidth]{Figures/recommended_APS.png}
\caption{Proficiency Mean (+/- SD) for Doorstep Interview and Path 1 cases under Recommended Settings of Models 2 and 3 ({\it Note. The horizontal dashed line indicates the cut score for Proficiency Level 1, set at 176.})}
\label{fig_DIpath1_Mod23}
\end{figure}

Based on these findings, the PIAAC Technical Advisory Group recommended evaluating a modified approach that maintains Model 3's core structure while addressing concerns about imputation. The proposed alternative, referred to as Model 3 with DI-like variable, eliminates cognitive data imputation for doorstep interview cases but introduces a new binary conditioning variable (coded 1 for Doorstep Interview/Path 1 cases and 0 otherwise) alongside the original six demographic variables. This modified approach preserves the seven-variable framework while offering a distinct methodological solution. 

Figure \ref{fig_final} compares country-level means (±SD) for both versions of Model 3, demonstrating that the original imputation-based approach yields superior results. 
Specifically, Model 3 with cognitive data imputation provides more consistent cross-country and cross-domain estimates by leveraging country-specific Path 1 proficiency distributions to inform the imputation process. As expected, both models showed consistent performance for literacy and numeracy. However, the modified version of Model 3 failed to limit the performance of the doorstep interview cases as intended because, by design, the doorstep-interview-like cases have cognitive data only for literacy and numeracy but not APS. Consequently, the doorstep-interview-like variable did not effectively constrain the performance of doorstep cases, as evidenced by the low variability in outcomes (i.e., the performance estimates with and without doorstep interview cases were very close). Figure \ref{fig_DIpath1_Mod3} reveals the overestimation (reaching Proficiency Level 3) of doorstep interview cases and Path 1 cases under the modified version of Model 3, compared to the expected performance of Path 1 cases under more robust model specifications like Model 3 with imputed cognitive responses (see Figure \ref{fig_DIpath1_Mod23}). These results confirm the value of carefully implemented cognitive data imputation for maintaining estimation accuracy in large-scale assessments.

%Based on these results, PIAAC technical advisory group suggested to check the results of a modified version of the recommended model (Model 3 with imputed cognitive data). The suggested model will not use imputation but adds a conditioning variable for doorstep-like cases; where “Doorstep Interview” and “Path 1” cases are coded as “1” and “0” otherwise. The new version of Model 3 (referred as Model 3 w/ DI-like variable) has two main characteristics: (a) using Model 3 without imputing cognitive data for doorstep interview cases, and (b) inclusion of an additional conditional variable: Doorstep-like variable to have a total of seven conditioning variables. Note that there is still no APS data in Path 1 cases. Figure \ref{fig_final} the country mean (+/- SD) with and without doorstep interview cases under two versions of Model 3. In summary, Model 3 with cognitive data imputed for doorstep interview cases provides better results: (a) Provides more consistent results across countries and domains and (b) Use country-specific Path 1 proficiency distribution to impute cognitive data. 

\begin{figure}[hbt!]
\centering
\includegraphics[width=0.6\linewidth]{Figures/LIT_final.png} \\
\includegraphics[width=0.6\linewidth]{Figures/NUM_final.png} \\
\includegraphics[width=0.6\linewidth]{Figures/APS_final.png}
\caption{Country Mean (+/- SD) with and without Doorstep Interview Cases under Two Versions of Model 3 ({\it Note. The horizontal dashed line indicates the cut score for Proficiency Level 1, set at 176.})}
\label{fig_final}
\end{figure}

\begin{figure}[hbt!]
\centering
\includegraphics[width=0.6\linewidth]{Figures/APS_final_P1vsDI.png}
\caption{Proficiency Mean (+/- SD) for Doorstep Interview and Path 1 Cases under Two Versions of Model 3 ({\it Note. The horizontal dashed line indicates the cut score for Proficiency Level 1, set at 176.})}
\label{fig_DIpath1_Mod3}
\end{figure}


\section{Conclusion}
This study examined methodological approaches for addressing literacy-related nonresponse (LRNR) in large-scale survey assessments, with three key findings:

First, standard imputation procedures relying on the missing-at-random assumption prove inadequate for LRNR cases, as language barriers create missing-not-at-random patterns that correlate with the assessed competencies. Our analysis demonstrates that conventional models like Model 1 (using full background questionnaire) produce unreliable estimates for these special subpopulations due to extensive missing covariates.

Second, among alternative approaches, Model 3 with cognitive data imputation emerged as the most effective solution, providing:
\begin{itemize}
\item Consistent proficiency estimates across countries and domains
\item Proper utilization of Path 1 respondent distributions
\item Reduced bias compared to non-imputation approaches
\end{itemize}

Third, the study highlights the critical trade-off between methodological limitations and representation - while no current approach is ideal, excluding LRNR cases introduces greater bias than model-based inclusion. This work advances assessment practice by:
\begin{itemize}
\item Validating an imputation framework for language-barrier cases
\item Demonstrating how demographic data can support more inclusive scoring
\item Establishing principles for handling non-ignorable nonresponse
\end{itemize}

Future research should explore hybrid designs combining doorstep interviews with refined imputation techniques. Nevertheless, this study provides actionable solutions for maintaining both validity and inclusivity in international assessments facing growing linguistic diversity.

%Large-scale survey assessments are group-score assessments, where results are reported on a target population (e.g., 15-year-old students; 4th grade students). Nonresponse can be a source of bias in survey estimates if the characteristics of the survey respondents differ from those of the nonrespondents (Van de Kerckhove et al., 2013).  Nonresponse can happen at different levels: e.g., school level (Menick et al., 2017; 7-33\% of countries failed to meet the minimum participation rates), person level, or item level (Köhler et al., 2017). Weighting and imputation procedures are often used to adjust for nonresponse and reduce potential nonresponse bias. Standard procedures are based on a missing at random assumption. 

%Non-ignorable nonresponse occurs when the reason for nonresponse is directly related to the survey outcome. Nonignorable nonresponse is also known as missing not at random. An example would be persons who cannot complete a health survey because they are too ill or nonresponse to a cognitive assessment due to literacy-related reasons. Work is going on to develop procedures to address this type of nonresponse: The concept of nonresponse questionnaires: these are shortened instruments applied to nonrespondents and aim to capture information that correlates with survey’s main outcomes (Menick et al., 2017). 

%The estimation of proficiency in special subpopulations, particularly those with language barriers, presents significant challenges that require innovative methodologies, ethical commitment, and best practices for inclusivity. The implications of excluding these individuals from large-scale assessments are far-reaching, influencing policy and resource allocation in educational and social services. Addressing these issues through research and practice will be crucial in ensuring that proficiency estimates provide accurate reflections of the entire population's capabilities. Enhanced methodologies will ultimately lead to more accurate and equitable assessments.

%In large-scale survey assessments, the inclusion of imputation models specifically tailored for linguistically diverse and incomplete-responding subpopulations not only enhances data inclusivity but also reinforces the validity of inferences drawn from assessment data. Implementing such models is particularly relevant as large-scale assessments expand to include countries with increasingly diverse linguistic demographics. Moreover, by integrating imputation techniques that respect the structure of partially observed data, international surveys like PIAAC and PISA can make better use of available demographic data, maintaining representation and reducing systematic bias introduced by language-related nonresponse.

%"You may have the freedom to choose, but not true freedom of choice (I. Kirsch, personal communication, 2024).`` While the model-based approach outlined above may not be ideal, the alternative—excluding a segment of the target population—risks introducing bias into the results. Further research is needed to develop improved designs that capture sufficient information from special subpopulations. Examples include the use of nonresponse questionnaires or doorstep interviews, as implemented in PIAAC and discussed by Menick et al. (2017).

%This study contributes to the literature by examining and evaluating an imputation framework designed for respondents facing language barriers in cognitive assessments. By leveraging available demographic information, this approach aligns with recent advancements in imputation modeling and aims to advance best practices for including underrepresented groups in global survey assessments.

\begin{acknowledgement}
The author acknowledges the support of this research by my psychometric and data analysis team colleagues at ETS Research Institute: Lokesh Kapur, Wei Zhao, and Mathew Kandathil.
\end{acknowledgement}

\paragraph{Funding Statement}

This work was conducted as part of a contract between ETS and the OECD (Organisation for Economic Co-operation and Development) for the implementation of the second cycle of the Programme for the International Assessment of Adult Competencies. Any views expressed in the paper is solely of the authors and do not necessarily reflect those of the OECD or its member countries.

\paragraph{Competing Interests}

None.


%\endnote in some journals will behave like \footnote; and \printendnotes will not output anything. 
%\printendnotes

\printbibliography

%\appendix

%\section{Example Appendix Section}

%Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. 

\end{document}







 
