Abstract: Use of synthetic data is rapidly emerging as
a realistic alternative to manually annotating
real data for industry-scale model building.
Manual data annotation is slow, expensive and
not preferred for meeting customer privacy expectations. Further, commercial natural language applications are required to support continuously evolving features as well as newly
added experiences. To address these requirements, we propose a targeted synthetic data
generation technique by inserting tokens into a
given semantic signature. The generated data
are used as additional training samples in the
tasks of intent classification and named entity
recognition. We evaluate on a real-world voice
assistant dataset, and using only 33% of the
available training set, we achieve the same accuracy as training with all available data. Further, we analyze the effects of data generation
across varied real-world applications and propose heuristics that improve the task performance further.
Loading