Keywords: agent memory, retrieval-augmented generation, evaluation protocols, distribution shift, memory robustness, long-context reasoning, retrieval evaluation
TL;DR: We introduce ShiftBench, a lightweight protocol for measuring agent memory recovery under distribution shift, showing that controlled session-boundary interruptions expose post-shift failures hidden by aggregate retrieval accuracy.
Abstract: Selecting memory policies by long-horizon accuracy can be misleading under shift, because rankings may reverse when evaluated by post-shift recovery. We introduce ShiftBench, a lightweight protocol defining shift segments and Recovery@T on LoCoMo and HaluMem-Long. On LoCoMo, lexical baselines (TF--IDF methods) show reversal under interruption (Spearman $\rho=-0.30$, inversion $0.60$), and alignment drops from $0.94$ to $0.70$ ($\Delta \rho=0.24$, 95\% CI $[0.12, 0.37]$). On HaluMem-Long, reversal is smaller but still present ($\rho=0.02$, inversion $0.50$). Overall, ShiftBench shows that post-shift recovery is a distinct evaluation axis that can change memory-policy selection.
Submission Number: 84
Loading