Offline Policy Learning under Compliance Uncertainty: Adoption-Aware Decision-Making with Observational-to-RCT Calibration Drift

Published: 25 May 2026, Last Modified: 27 May 2026DEMO 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: offline policy learning, contextual bandits, off-policy evaluation, observational-to-RCT calibration, conformal prediction, compliance-aware decision-making, recommender systems, smallholder agricultural ML
Abstract: Recommender systems are routinely trained on offline observational data and deployed to make per-user decisions, where down- stream success depends jointly on the offline- learned outcome model and on whether the user actually adopts the recommendation — a quantity logged for observed actions but not counterfactual ones. We study this offline-to-online gap concretely in small- holder agricultural recommendation, a set- ting that combines offline contextual-bandit decision-making, structured action complex- ity, and externally-identified counterfactuals from agricultural-economics RCTs that per- mit direct calibration testing of any offline policy. We formalize Expected Realized Ben- efit (ERB), an offline objective that internal- izes per-action adoption likelihood: ERB(a | x) = P(adopt a | x)· (E[Y | a]− E[Y | a0]). Counterfactual queries on actions never ob- served for a given context are addressed with a hybrid architecture (ML-learned baseline + literature-anchored linear response); we ex- plicitly position the resulting estimator as a direct-method off-policy estimator and de- fer IPS/DR alternatives that require identi- fied logging-policy propensities (unavailable in our setting). On a real LSMS-ISA Ethiopia panel (19,339 plot-level observations from 6,770 households), our offline policy delivers a statistically significant aggregate ERB lift over an accuracy-only baseline of +50 kg/ha (bootstrap 95% CI [+24, +74], n = 800 EA- disjoint test plots), robust to four orthogonal sensitivity sweeps: penalty-magnitude per- turbation, a yield-minus-λ·complexity heuris- tic baseline, an adoption-on-yield feedback- loop sweep, and a hybrid-vs-ML-only abla- tion. We then probe the observational-to- RCT calibration drift: external validation against the Duflo–Kremer–Robinson (2011) Kenya SAFI fertilizer experiment shows the offline-trained adoption head reproduces the experimentally-measured directional order- ing of treatment effects but with substan- tial absolute miscalibration (∼ 50 percentage points on out-of-distribution actions). This characterizes a concrete and reproducible offline-to-online drift in compliance probabil- ity and motivates OS→RCT calibration ex- tensions for any offline policy. Code, harmo- nization pipelines, and the RCT-validation harness are released.
Submission Number: 163
Loading