<! $Id: nbest-optimize.1,v 1.31 2019/09/09 22:35:37 stolcke Exp $>
<HTML>
<HEADER>
<TITLE>nbest-optimize</TITLE>
<BODY>
<H1>nbest-optimize</H1>
<H2> NAME </H2>
nbest-optimize - optimize score combination for N-best word error minimization
<H2> SYNOPSIS </H2>
<PRE>
<B>nbest-optimize</B> [ <B>-help</B> ] <I>option</I> ... [ <I>scoredir</I> ... ]
</PRE>
<H2> DESCRIPTION </H2>
<B> nbest-optimize </B>
reads a set of N-best lists, additional score files, and corresponding 
reference transcripts and optimizes the score combination weights
so as to minimize the word error of a classifier that performs
word-level posterior probability maximization.
The optimized weights are meant to be used with
<A HREF="nbest-lattice.1.html">nbest-lattice(1)</A>
and the
<B> -use-mesh </B>
option,
or the 
<B> nbest-rover </B>
script (see
<A HREF="nbest-scripts.1.html">nbest-scripts(1)</A>).
<B> nbest-optimize </B>
determines both the best relative weighting of knowledge source scores
and the optimal 
<B> -posterior-scale </B>
parameter that controls the peakedness of the posterior distribution.
<P>
Alternatively,
<B> nbest-optimize </B>
can also optimize weights for a standard, 1-best hypothesis rescoring that
selects entire (sentence) hypotheses
(<B>-1best</B><B></B><B></B>
option).
In this mode sentence-level error counts may be read from external files,
or computed on the fly from the reference strings.
The weights obtained are meant to be used for N-best list rescoring with
<B> rescore-reweight </B>
(see 
<A HREF="nbest-scripts.1.html">nbest-scripts(1)</A>).
<P>
A third optimization criterion is the BLEU score used in machine translation.
This also requires the associated scores to be read from external files.
<P>
One of three optimization algorithms are available:
<DL>
<DT>1.
<DD>
The default optimization method is gradient descent on a smoothed (sigmoidal)
approximation of the true 0/1 word error function (Katagiri et al. 1990).
Therefore, the result can only be expected to be a
<I> local </I>
minimum of the error surface.
(A more global search can be attempted by specifying different starting
points.)
Another approximation is that the error function is computed assuming a fixed
multiple alignment of all N-best hypotheses and the reference string,
which tends to slightly overestimate the true pairwise error between any 
single hypothesis and the reference.
<DT>2.
<DD>
An alternative search strategy uses a simplex-based "Amoeba" search on
the (non-smoothed) error function (Press et al. 1988).
The search is restarted multiple times to avoid local minima.
<DT>3.
<DD>
A third algorithm uses Powell search (Press et al. 1988)
on the (non-smoothed) error function.
</DD>
</DL>
<H2> OPTIONS </H2>
<P>
Each filename argument can be an ASCII file, or a 
compressed file (name ending in .Z or .gz), or ``-'' to indicate
stdin/stdout.
<DL>
<DT><B> -help </B>
<DD>
Print option summary.
<DT><B> -version </B>
<DD>
Print version information.
<DT><B>-debug</B><I> level</I><B></B><I></I><B></B><I></I><B></B>
<DD>
Controls the amount of output (the higher the
<I>level</I>,<I></I><I></I><I></I>
the more).
At level 1, error statistics at each iteration are printed.
At level 2, word alignments are printed.
At level 3, full score matrix is printed.
At level 4, detailed information about word hypothesis ranking is printed
for each training iteration and sample.
<DT><B>-nbest-files</B><I> file-list</I><B></B><I></I><B></B><I></I><B></B>
<DD>
Specifies the set of N-best files as a list of filenames.
Three sets of standard scores are extracted from the N-best files:
the acoustic model score, the language model score, and the number of 
words (for insertion penalty computation).
See 
<A HREF="nbest-format.5.html">nbest-format(5)</A>
for details.
<BR>
In BLEU optimization mode, since there is no acoustic score, the
position
of the first score is taken by the "ac-replacement" score, which can be
any score used by the machine translation system.
A typical example is a score measuring
word order distortion between the source and target languages.
<DT><B>-xval-files</B><I> file-list</I><B></B><I></I><B></B><I></I><B></B>
<DD>
Use the N-best files listed in 
<I> file-list </I>
for cross-validation.
Optimization stops once no improvement is found on the cross-validation set
(while the direction of the search is obtained from the training set).
This is only supported with the gradient-descent and Amoeba optimization
methods.
<DT><B>-srinterp-format</B><I></I><B></B><I></I><B></B><I></I><B></B>
<DD>
Parse n-best list in SRInterp format, which has 
features and text in the same line. n-best optimize will also generate
rover-control file in SRInterp format, where each line is in the form of:
<PRE>
	<I>F1=V1</I> <I>F2=V2</I> ... <I>Fm=Vm</I> <I>W1</I> <I>W2</I> ...<I>Wn</I>
</PRE>
where
<I> F1 </I>
through
<I> Fm </I>
are feature names,
<I> V1 </I>
through
<I> Vm </I>
are feature values,
<I> W1 </I>
through
<I> Wn </I>
are words.
Also generate SRInterp control file, in the format of:
<PRE>
	<I>F1:S1</I> <I>F2:S2</I> ... <I>Fm:Sm</I>
</PRE>
where 
<I> S1 </I>
through
<I> Sm </I>
are scaling factors (weights) for feature 
<I> F1 </I>
through
<I> Fm </I>
.
<DT><B>-refs</B><I> references</I><B></B><I></I><B></B><I></I><B></B>
<DD>
Specifies the reference transcripts.
Each line in 
<I> references </I>
must contain the sentence ID (the last component in the N-best filename
path, minus any suffixes) followed by zero or more reference words.
<DT><B>-insertion-weight</B><I> W</I><B></B><I></I><B></B><I></I><B></B>
<DD>
Weight insertion errors by a factor 
<I>W</I>.<I></I><I></I><I></I>
This may be useful to optimize for keyword spotting tasks where
insertions have a cost different from deletion and substitution errors.
<DT><B>-word-weights</B><I> file</I><B></B><I></I><B></B><I></I><B></B>
<DD>
Read a table of words and weights from
<I>file</I>.<I></I><I></I><I></I>
Each word error is weighted according to the word-specific weight.
The default weight is 1, and used if a word has no specified weight.
Also, when this option is used, substitution errors are counted 
as the sum of a deletion and an insertion error, as opposed to counting
as 1 error as in traditional word error computation.
<DT><B>-anti-refs</B><I> file</I><B></B><I></I><B></B><I></I><B></B>
<DD>
Read a file of "anti-references" for use with the 
<B> -anti-ref-weight </B>
option (see below).
<DT><B>-anti-ref-weight</B><I> W</I><B></B><I></I><B></B><I></I><B></B>
<DD>
Compute the hypothesis errors for 1-best optimization by adding the
edit distance with respect to the "anti-references" times the weight
<I> W </I>
to the regular error count.
If 
<I> W </I>
is negative this will tend to generate hypotheses that are different from
the anti-references (hence the name).
<DT><B> -1best </B>
<DD>
Select optimization for standard sentence-level hypothesis selection.
<DT><B> -1best-first </B>
<DD>
Optimized first using 
<B> -1best </B>
mode, then switch to full optimization.
This is an effective way to quickly bring the score weights near an
optimal point, and then fine-tune them jointly with the posterior scale
parameter.
<DT><B>-errors</B><I> dir</I><B></B><I></I><B></B><I></I><B></B>
<DD>
In 1-best mode, optimize for error counts that are stored in separate files
in directory
<I>dir</I>.<I></I><I></I><I></I>
Each N-best list must have a matching error counts file of the same 
basename in 
<I>dir</I>.<I></I><I></I><I></I>
Each file contains 7 columns of numbers in the format
<PRE>
	<I>wcr</I> <I>wer</I> <I>nsub</I> <I>ndel</I> <I>nins</I> <I>nerr</I> <I>nw</I>
</PRE>
Only the last two columns (number of errors and words, respectively) are used.
<BR>
If this option is omitted, errors will be computed from the N-best hypotheses
and the reference transcripts.
<DT><B>-bleu-counts</B><I> dir</I><B></B><I></I><B></B><I></I><B></B>
<DD>
Perform BLEU optimization, reading BLEU reference counts from directory 
<I>dir</I>.<I></I><I></I><I></I>
Each N-best list must have a matching counts file of the same 
basename in 
<I>dir</I>,<I></I><I></I><I></I>
containing the following information:
<PRE>
	<I>N</I> <I>M</I> <I>L1</I> ... <I>LM</I>
</PRE>
where
<I> N </I>
is the number of hypotheses in the N-best list,
<I> M </I>
is the number of references for the utterance,
and 
<I> L1 </I>
through
<I> LM </I>
are the reference lengths (word counts) for each reference.
Following this line, there are 
<I> N </I>
lines of the form
<PRE>
	<I>K</I> <I>C1</I> <I>C2</I> ... <I>Cm</I>
</PRE>
where 
<I> K </I>
is the number of words in the hypothesis and
<I> C1 </I>
through
<I> Cm </I>
are the N-gram counts occurring in the references for each N-gram order
1, ...,
<I>m</I>.<I></I><I></I><I></I>
Currently,
<I> m </I>
is limited to 4.
<DT><B> -minimum-bleu-reference </B>
<DD>
Use shortest reference length to compute the BLEU brevity penalty.
<DT><B> -closest-bleu-reference </B>
<DD>
Use closest reference length for each translation hypothesis to compute
the BLEU brevity penalty.
<DT><B> -average-bleu-reference </B>
<DD>
Use average reference length to compute the BLEU brevity penalty.
<DT><B> -error-bleu-ratio  R </B>
<DD>
Specifies the weight of error rate when combined with BLEU as optimization
objective: <I>(1-BLEU) + ERR x R</I>. 
<I> ERR </I>
is error rate computed by #errors/#references.
<DT><B>-max-nbest</B><I> n</I><B></B><I></I><B></B><I></I><B></B>
<DD>
Limits the number of hypotheses read from each N-best list to the first
<I>n</I>.<I></I><I></I><I></I>
<DT><B>-rescore-lmw</B><I> lmw</I><B></B><I></I><B></B><I></I><B></B>
<DD>
Sets the language model weight used in combining the language model log
probabilities with acoustic log probabilities.
This is used to compute initial aggregate hypotheses scores.
<DT><B>-rescore-wtw</B><I> wtw</I><B></B><I></I><B></B><I></I><B></B>
<DD>
Sets the word transition weight used to weight the number of words relative to
the acoustic log probabilities.
This is used to compute initial aggregate hypotheses scores.
<DT><B>-posterior-scale</B><I> scale</I><B></B><I></I><B></B><I></I><B></B>
<DD>
Initial value for scaling log posteriors.
The total weighted log score is divided by 
<I> scale </I>
when computing normalized posterior probabilities.
This controls the peakedness of the posterior distribution. 
The default value is whatever was chosen for 
<B>-rescore-lmw</B>,<B></B><B></B><B></B>
so that language model scores are scaled to have weight 1,
and acoustic scores have weight 1/<I>lmw</I>.
<DT><B> -combine-linear </B>
<DD>
Compute aggregate scores by linear combination, rather than log-linear
combination.
(This is appropriate if the input scores represent log-posterior probabilities.)
<DT><B> -non-negative </B>
<DD>
Constrain search to non-negative weight values.
<DT><B>-vocab</B><I> file</I><B></B><I></I><B></B><I></I><B></B>
<DD>
Read the N-best list vocabulary from 
<I>file</I>.<I></I><I></I><I></I>
This option is mostly redundant since words found in the N-best input
are implicitly added to the vocabulary.
<DT><B> -tolower </B>
<DD>
Map vocabulary to lowercase, eliminating case distinctions.
<DT><B> -multiwords </B>
<DD>
Split multiwords (words joined by '_') into their components when reading
N-best lists.
<DT><B>-multi-char</B><I> C</I><B></B><I></I><B></B><I></I><B></B>
<DD>
Character used to delimit component words in multiwords
(an underscore character by default).
<DT><B> -no-reorder </B>
<DD>
Do not reorder the hypotheses for alignment, and start the alignment with
the reference words.
The default is to first align hypotheses by order of decreasing scores
(according to the initial score weighting) and then the reference,
which is more compatible with how 
<A HREF="nbest-lattice.1.html">nbest-lattice(1)</A>
operates.
<DT><B>-noise</B><I> noise-tag</I><B></B><I></I><B></B><I></I><B></B>
<DD>
Designate
<I> noise-tag </I>
as a vocabulary item that is to be ignored in aligning hypotheses with
each other (the same as the -pau- word).
This is typically used to identify a noise marker.
<DT><B>-noise-vocab</B><I> file</I><B></B><I></I><B></B><I></I><B></B>
<DD>
Read several noise tags from
<I>file</I>,<I></I><I></I><I></I>
instead of, or in addition to, the single noise tag specified by
<B>-noise</B>.<B></B><B></B><B></B>
<DT><B>-hidden-vocab</B> file<B></B><B></B><B></B>
<DD>
Read a subvocabulary from
<I> file </I>
and constrain word alignments to only group those words that are either all
in or outside the subvocabulary.
This may be used to keep ``hidden event'' tags from aligning with
regular words.
<DT><B>-dictionary</B> file<B></B><B></B><B></B>
<DD>
Use word pronunciations listed in 
<I> file </I>
to construct word alignments when building word meshes.
This will use an alignment cost function that reflects the number of
inserted/deleted/substituted phones, rather than words.
The dictionary 
<I> file </I>
should contain one pronunciation per line, each naming a word in the first
field, followed by a string of phone symbols.
<DT><B>-distances</B> file<B></B><B></B><B></B>
<DD>
Use the word distance matrix in 
<I> file </I>
as a cost function for word alignments.
Each line in 
<I> file </I>
defines a row of the distance matrix.
The first field contains the word that is the row index,
followed by one or more word/number pairs, where the word represents the 
column index and the number the distance value.
<DT><B>-init-lambdas</B><I> 'w1 w2 ...'</I><B></B><I></I><B></B><I></I><B></B>
<DD>
Initialize the score weights to the values specified
(zeros are filled in for missing values).
The default is to set the initial acoustic model weight to 1,
the language model weight from
<B>-rescore-lmw</B>,<B></B><B></B><B></B>
the word transition weight from
<B>-rescore-wtw</B>,<B></B><B></B><B></B>
and all remaining weights to zero initially.
Prefixing a value with an equal sign (`=')
holds the value constant during optimization.
(All values should be enclosed in quotes to form a single command-line
argument.)
<BR>
Hypotheses are aligned using the initial weights; thus, it makes sense
to reoptimize with initial weights from a previous optimization in order
to obtain alignments closer to the optimimum.
<DT><B>-alpha</B><I> a</I><B></B><I></I><B></B><I></I><B></B>
<DD>
Controls the error function smoothness; 
the sigmoid slope parameter is set to
<I>a</I>.<I></I><I></I><I></I>
<DT><B>-epsilon</B><I> e</I><B></B><I></I><B></B><I></I><B></B>
<DD>
The step-size used in gradient descent (the multiple of the gradient vector).
<DT><B>-min-loss</B><I> x</I><B></B><I></I><B></B><I></I><B></B>
<DD>
Sets the loss function for a sample effectively to zero when its value falls
below 
<I>x</I>.<I></I><I></I><I></I>
<DT><B>-max-delta</B><I> d</I><B></B><I></I><B></B><I></I><B></B>
<DD>
Ignores the contribution of a sample to the gradient if the derivative
exceeds
<I>d</I>.<I></I><I></I><I></I>
This helps avoid numerical problems.
<DT><B>-maxiters</B><I> m</I><B></B><I></I><B></B><I></I><B></B>
<DD>
Stops optimization after 
<I> m </I>
iterations.
In Amoeba search, this limits the total number of points in the parameter space
that are evaluated.
<DT><B>-max-bad-iters</B> n<B></B><B></B><B></B>
<DD>
Stops optimization after 
<I> n </I>
iterations during which the actual (non-smoothed) error has not decreased.
<DT><B>-max-amoeba-restarts</B> r<B></B><B></B><B></B>
<DD>
Perform only up to
<I> r </I>
repeated Amoeba searches.
The default is to search until 
<I> D </I>
searches give the same results, where
<I> D </I>
is the dimensionality of the problem.
<DT><B>-max-time</B><I> T</I><B></B><I></I><B></B><I></I><B></B>
<DD>
Abort search if new lower-error point isn't found in 
<I> T </I>
seconds.
<DT><B>-epsilon-stepdown</B><I> s</I><B></B><I></I><B></B><I></I><B></B>
<DD>
<DT><B>-min-epsilon</B><I> m</I><B></B><I></I><B></B><I></I><B></B>
<DD>
If 
<I> s </I>
is a value greater than zero, the learning rate will be multiplied by 
<I> s </I>
every time the error does not decrease after a number of iterations
specified by
<B>-max-bad-iters</B>.<B></B><B></B><B></B>
Training stops when the learning rate falls below
<I> m </I>
in this manner.
<DT><B>-converge</B><I> x</I><B></B><I></I><B></B><I></I><B></B>
<DD>
Stops optimization when the (smoothed) loss function changes relatively by less 
than 
<I> x </I>
from one iteration to the next.
<DT><B> -quickprop </B>
<DD>
Use the approximate second-order method known as "QuickProp" (Fahlman 1989).
<DT><B>-init-amoeba-simplex</B><I> 's1 s2 ...'</I><B></B><I></I><B></B><I></I><B></B>
<DD>
Perform Amoeba simplex search.
The argument defines the step size for the initial Amoeba simplex.
One value for each non-fixed search dimension should be specified,
plus optionally a value for the posterior scaling parameter
(which is searched as an added dimension).
<DT><B>-init-powell-range</B><I> 'a1</I><B>,</B><I>b1 a2</I><B>,</B><I>b2 ...'</I><B></B>
<DD>
Perform Powell search.
The argment initializes the weight ranges for Powell search.
One comma-separated pair of values for each search dimension should
be specified. For each dimension, if the upper bound equals lower bound
and initial lambda, that dimension will be fixed, even if not so specified by 
<B> -init-lambda . </B>
<DT><B>-num-powell-runs</B><I> N</I><B></B><I></I><B></B><I></I><B></B>
<DD>
Sets the number of random runs for quick Powell grid search
(default value is 20).
<DT><B> -dynamic-random-series </B>
<DD>
Use time and process ID to initialize seed for pseudo random series used
in Powell search.
This will make results unrepeatable but may yield better results through
multiple trials.
<DT><B>-print-hyps</B><I> file</I><B></B><I></I><B></B><I></I><B></B>
<DD>
Write the best word hypotheses to 
<I> file </I>
after optimization.
<DT><B>-print-top-n</B><I> N</I><B></B><I></I><B></B><I></I><B></B>
<DD>
Write out the top 
<I> N </I>
rescored hypotheses.
In this case
<B> -print-hyps </B>
specifies a directory (not a file)
and one file per N-best list is generated.
<DT><B> -print-unique-hyps </B>
<DD>
Eliminate duplicate hypotheses when writing out N-best hypotheses.
<DT><B> -print-old-ranks </B>
<DD>
Output the original hypothesis ranks when writing out N-best hypotheses.
<DT><B> -compute-oracle </B>
<DD>
Find the lowest error rate or the highest BLEU score achievable by choosing
among all N-best hypotheses.
<DT><B>-print-oracle-hyps</B><I> file</I><B></B><I></I><B></B><I></I><B></B>
<DD>
Print output oracle hyps to
<I>file</I>.<I></I><I></I><I></I>
<DT><B>-write-rover-control</B><I> file</I><B></B><I></I><B></B><I></I><B></B>
<DD>
Writes a control file for 
<B> nbest-rover </B>
to 
<I>file</I>,<I></I><I></I><I></I>
reflecting the names of the input directories and the optimized parameter
values.
The format of
<I> file </I>
is described in
<A HREF="nbest-scripts.1.html">nbest-scripts(1)</A>.
The file is rewritten for each new minimal error weight combination found.
<BR>
In BLEU optimization, the weight for the ac-replacement score will be written
in the place of the posterior scale,
since posterior scaling is not used in BLEU optimization.
<DT><B> -skipopt </B>
<DD>
Skip optimization altogether, such as when only the 
<B> -print-hyps </B>
function is to be exercised.
<DT><B> -- </B>
<DD>
Signals the end of options, such that following command-line arguments are 
interpreted as additional scorefiles even if they start with `-'.
<DT><I>scoredir</I> ...<I></I><I></I><I></I>
<DD>
Any additional arguments name directories containing further score files.
In each directory, there must exist one file named after the sentence 
ID it corresponds to (the file may also end in ``.gz'' and contain compressed
data).
The total number of score dimensions is thus 3 (for the standard scores from
the N-best list) plus the number of additional score directories specified.
</DD>
</DL>
<H2> SEE ALSO </H2>
<A HREF="nbest-lattice.1.html">nbest-lattice(1)</A>, <A HREF="nbest-scripts.1.html">nbest-scripts(1)</A>, <A HREF="nbest-format.5.html">nbest-format(5)</A>.
<BR>
S. Katagiri, C.H. Lee, &amp; B.-H. Juang, "A Generalized Probabilistic Descent
Method", in
<I>Proceedings of the Acoustical Society of Japan, Fall Meeting</I>,
pp. 141-142, 1990.
<BR>
S. E. Fahlman, "Faster-Learning Variations on Back-Propagation: An
Empirical Study", in D. Touretzky, G. Hinton, &amp; T. Sejnowski (eds.), 
<I>Proceedings of the 1988 Connectionist Models Summer School</I>, pp. 38-51,
Morgan Kaufmann, 1989.
<BR>
W. H. Press, B. P. Flannery, S. A. Teukolsky, &amp; W. T. Vetterling,
<I>Numerical Recipes in C: The Art of Scientific Computing</I>,
Cambridge University Press, 1988.
<BR>
<H2> BUGS </H2>
Gradient-based optimization is not supported (yet) in 1-best or BLEU mode
or in conjunction with the 
<B> -combine-linear </B>
or 
<B> -non-negative </B>
options;
use simplex or Powell search instead.
<BR>
The N-best directory in the control file output by
<B> -write-rover-control </B>
is inferred from the
first N-best filename specified with
<B>-nbest-files</B>,<B></B><B></B><B></B>
and will therefore only work if all N-best lists are placed in the same
directory.
<P>
The
<B> -insertion-weight </B>
and 
<B> -word-weights </B>
options only affect the word error computation, not the construction 
of hypothesis alignments. 
Also, they only apply to sausage-based, not 1-best error optimization.
(1-best errors may be explicitly specified using the 
<B> -errors </B>
option).
<P>
The 
<B> -anti-refs </B>
and
<B> -anti-ref-weight </B>
options do not work for sausage-based or BLEU optimization.
<H2> AUTHORS </H2>
Andreas Stolcke &lt;andreas.stolcke@microsoft.com&gt;,
Dimitra Vergyri &lt;dverg@speech.sri.com&gt;,
Jing Zheng &lt;zj@speech.sri.com&gt;
<BR>
Copyright (c) 2000-2012 SRI International
<BR>
Copyright (c) 2012-2016 Andreas Stolcke
<BR>
Copyright (c) 2012-2016 Microsoft Corp.
</BODY>
</HTML>
