Abstract: Humans are remarkably capable of zero-shot generalizing while performing tasks in new settings, even when the task is learned entirely from observing others. In this work, we show that current imitation-based policy learning methods do not share this capability, lacking robustness to minor shifts in the training environment. To demonstrate these limitations of current methods, we propose a testing protocol that new methods may use as a benchmark. We implement and evaluate KitchenShift, an instance of our testing protocol that applies domain shifts to a realistic kitchen environment. We train policies from RGB image observations using a set of demonstrations for a multi-stage robotic manipulation task in the kitchen environment. Using KitchenShift, we evaluate imitation and representation learning methods used in current policy learning approaches and find that they are not robust to visual changes in the scene (e.g., lighting, camera view) or changes in the environment state (e.g., orientation of an object). With our benchmark, we hope to encourage the development of algorithms that can generalize under such domain shifts and overcome the challenges preventing robots from completing tasks in diverse everyday settings.