Abstract: Decoding methods for large language models
often trade-off between diversity of outputs
and parallelism of computation. Methods such
as beam search and Gumbel top-k sampling
can guarantee a different output for each
element of the beam, but are not easy to
parallelize. Alternatively, methods such as
temperature sampling and its modifications (top-k
sampling, nucleus sampling, typical decoding,
and others), are embarrassingly parallel, but have
no guarantees about duplicate samples. We
present a framework for sampling according
to an arithmetic code book implicitly defined
by a large language model, compatible with
common sampling variations, with provable beam
diversity under certain conditions, as well as being
embarrassingly parallel and providing unbiased
and consistent expectations from the original
model. We demonstrate the effectiveness of
our approach on WMT machine translation,
more than halving the standard deviation when
estimating expected BLEU score reward, and
closing the BLEU score gap between independent
sampling and beam search by up to 63%.
0 Replies
Loading