Concentration Bounds for Unigrams Language Model

Evgeny Drukh, Yishay Mansour

2004 (modified: 14 Jan 2021)COLT 2004Readers: Everyone

Abstract: We show several PAC-style concentration bounds for learning unigrams language model. One interesting quantity is the probability of all words appearing exactly k times in a sample of size m. A standard estimator for this quantity is the Good-Turing estimator. The existing analysis on its error shows a PAC bound of approximately \(O(\frac{k}{\sqrt{m}})\). We improve its dependency on k to \(O(\frac{\sqrt[4]{k}}{\sqrt{m}}+\frac{k}{m})\). We also analyze the empirical frequencies estimator, showing that its PAC error bound is approximately \(O(\frac{1}{k}+\sqrt{k}{m})\). We derive a combined estimator, which has an error of approximately \(O(m-\frac{2}{5})\), for any k.

0 Replies