Kern: A Labeling Environment for Large-Scale, High-Quality Training Data

Johannes Hötter, Henrik Wenck, Moritz Feuerpfeil, Simon Witzke

Published: 2022, Last Modified: 21 May 2024NLDB 2022EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: The lack of large-scale, high-quality training data is a significant bottleneck in supervised learning. We introduce kern, a labeling environment used by machine learning experts and subject matter experts to create training data and find manual labeling errors powered by weak supervision, active transfer learning, and confident learning. We explain the current workflow and system overview and showcase the benefits of our system in an intent classification experiment, where we reduce the labeling error rate of a given dataset by an absolute 4.9% while improving the F\(_1\) score of a baseline classifier by a total of 9.7%.