A dataset for evaluating identifier splitters

David W. Binkley, Dawn J. Lawrie, Lori L. Pollock, Emily Hill, K. Vijay-Shanker

Published: 2013, Last Modified: 12 Jan 2026MSR 2013EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Software engineering and evolution techniques have recently started to exploit the natural language information in source code. A key step in doing so is splitting identifiers into their constituent words. While simple in concept, identifier splitting raises several challenging issues, leading to a range of splitting techniques. Consequently, the research community would benefit from a dataset (i.e., a gold set) that facilitates comparative studies of identifier splitting techniques. A gold set of 2,663 split identifiers was constructed from 8,522 individual human splitting judgements and can be obtained from www.cs.loyola.edu/~binkley/ludiso. This set's construction and observations aimed at its effective use are described.

External IDs:dblp:conf/msr/BinkleyLPHV13