Abstract: Highlights•A benchmark for cross-dataset/modality evaluation of abdominal segmentation.•Systematic evaluation of generalization across CT/MRI using 7 datasets.•Studies unlabeled data, multi-modality and joint training for generalization.•Studies different model backbones and scales for generalization.