Explore the Potential Performance of Vision-and-Language Navigation Model: a Snapshot Ensemble Method
Abstract: Given an instruction in a natural language, the vision-and-language navigation (VLN) task requires a navigation model to match the instruction to its visual surroundings and then move to the correct destination. It has been difficult to build VLN models that can generalize as well as humans. In this paper, we provide a new perspective that accommodates the potential variety of interpretations of verbal instructions. We discovered that snapshots of a VLN model, i.e., model versions based on parameters saved at various intervals during its training, behave significantly differently even when their navigation success rates are almost the same. We thus propose a snapshot-based ensemble solution that leverages predictions provided by multiple snapshots. Our approach is effective and generalizable, and can be applied to ensemble snapshots from different models. Constructed on the mixed snapshots of the existing state-of-the-art (SOTA) RecBERT and HAMT models, our proposed ensemble achieves new SOTA performance in the R2R Dataset Challenge in the single-run setting.
Paper Type: long
0 Replies
Loading