Offline Policy Learning under Compliance Uncertainty: Adoption-Aware Decision-Making with Observational-to-RCT Calibration Drift
Keywords: offline policy learning, contextual bandits, off-policy evaluation, observational-to-RCT calibration, conformal prediction, compliance-aware decision-making, recommender systems, smallholder agricultural ML
Abstract: Recommender systems are routinely trained
on offline observational data and deployed
to make per-user decisions, where down-
stream success depends jointly on the offline-
learned outcome model and on whether the
user actually adopts the recommendation
— a quantity logged for observed actions
but not counterfactual ones. We study
this offline-to-online gap concretely in small-
holder agricultural recommendation, a set-
ting that combines offline contextual-bandit
decision-making, structured action complex-
ity, and externally-identified counterfactuals
from agricultural-economics RCTs that per-
mit direct calibration testing of any offline
policy. We formalize Expected Realized Ben-
efit (ERB), an offline objective that internal-
izes per-action adoption likelihood: ERB(a |
x) = P(adopt a | x)· (E[Y | a]− E[Y | a0]).
Counterfactual queries on actions never ob-
served for a given context are addressed with
a hybrid architecture (ML-learned baseline +
literature-anchored linear response); we ex-
plicitly position the resulting estimator as
a direct-method off-policy estimator and de-
fer IPS/DR alternatives that require identi-
fied logging-policy propensities (unavailable
in our setting). On a real LSMS-ISA Ethiopia
panel (19,339 plot-level observations from
6,770 households), our offline policy delivers
a statistically significant aggregate ERB lift
over an accuracy-only baseline of +50 kg/ha
(bootstrap 95% CI [+24, +74], n = 800 EA-
disjoint test plots), robust to four orthogonal
sensitivity sweeps: penalty-magnitude per-
turbation, a yield-minus-λ·complexity heuris-
tic baseline, an adoption-on-yield feedback-
loop sweep, and a hybrid-vs-ML-only abla-
tion. We then probe the observational-to-
RCT calibration drift: external validation
against the Duflo–Kremer–Robinson (2011)
Kenya SAFI fertilizer experiment shows the
offline-trained adoption head reproduces the
experimentally-measured directional order-
ing of treatment effects but with substan-
tial absolute miscalibration (∼ 50 percentage
points on out-of-distribution actions). This
characterizes a concrete and reproducible
offline-to-online drift in compliance probabil-
ity and motivates OS→RCT calibration ex-
tensions for any offline policy. Code, harmo-
nization pipelines, and the RCT-validation
harness are released.
Submission Number: 163
Loading