Buy 4 REINFORCE Samples, Get a Baseline for Free!

Wouter Kool, Herke van Hoof, Max Welling

Mar 16, 2019 ICLR 2019 Workshop drlStructPred Blind Submission readers: everyone
  • Keywords: reinforce, multiple samples, baseline, sequence generation, structured prediction, travelling salesman problem
  • TL;DR: We show that by drawing multiple samples (predictions) per input (datapoint), we can learn with less data as we freely obtain a REINFORCE baseline.
  • Abstract: REINFORCE can be used to train models in structured prediction settings to directly optimize the test-time objective. However, the common case of sampling one prediction per datapoint (input) is data-inefficient. We show that by drawing multiple samples (predictions) per datapoint, we can learn with significantly less data, as we freely obtain a REINFORCE baseline to reduce variance. Additionally we derive a REINFORCE estimator with baseline, based on sampling without replacement. Combined with a recent technique to sample sequences without replacement using Stochastic Beam Search, this improves the training procedure for a sequence model that predicts the solution to the Travelling Salesman Problem.
0 Replies