Keywords: reinforce, multiple samples, baseline, sequence generation, structured prediction, travelling salesman problem
TL;DR: We show that by drawing multiple samples (predictions) per input (datapoint), we can learn with less data as we freely obtain a REINFORCE baseline.
Abstract: REINFORCE can be used to train models in structured prediction settings to directly optimize the test-time objective. However, the common case of sampling one prediction per datapoint (input) is data-inefficient. We show that by drawing multiple samples (predictions) per datapoint, we can learn with significantly less data, as we freely obtain a REINFORCE baseline to reduce variance. Additionally we derive a REINFORCE estimator with baseline, based on sampling without replacement. Combined with a recent technique to sample sequences without replacement using Stochastic Beam Search, this improves the training procedure for a sequence model that predicts the solution to the Travelling Salesman Problem.