Abstract: We introduce an algorithm to estimate the evolution of accuracy in part-of-speech tagging on the whole of a training corpus, based on the results obtained from a portion of the latter. The technique approximates iteratively the vallue that we seek in the position desired, independently of the statistical model and dataset used. The process proves to be formally correct with respect to our working specifications and includes a stable stopping criterion. This allows the user to fix a reliable convergence threshold with respect to the
accuracy finally achievable.
Our aim is to evaluate the training effort, supporting decision making in order to reduce the need for both human and computational resources during tagger generation. The proposal is of interest in at least three operational procedures. The first is the anticipation of accuracy gain, with the purpose of measuring how much work is needed to achieve a certain level of performance. The second relates the comparison between taggers at training time, with the objective of completing this task only for the tool that predictably better suits our requirements. The prediction of accuracy is also a valuable item of information for the customization of the tagger, for example to select the tag-set, since we can estimate in advance its impact on both the performance and the development costs. The experiments corroborate our initial expectations.
0 Replies
Loading