Abstract: Maintaining modern software requires significant tool
support. Effective tools exploit a variety of information and
techniques to aid a software maintainer. One area of recent
interest in tool development exploits the natural language
information found in source code. Such Information Retrieval (IR) based tools compliment traditional static analysis tools and have tackled problems, such as feature location, that otherwise require considerable human effort. To
reap the full benefit of IR-based techniques, the language
used across all software artifacts (e.g., requirements, design, change requests, tests, and source code) must be consistent. Unfortunately, there is a significant proportion of
invented vocabulary in source code. Vocabulary normalization aligns the vocabulary found in the source code with that
found in other software artifacts. Most existing work related
to normalization has focused on splitting an identifier into
its constituent parts. The next step is to expand each part
into a (dictionary) word that matches the vocabulary used
in other software artifacts.
Building on a successful approach to splitting identifiers,
an implementation of an expansion algorithm is presented.
Experiments on two systems find that up to 66% of identifiers are correctly expanded, which is within about 20%
of the current system’s best-case performance. Not only
is this performance comparable to previous techniques, but
the result is achieved in the absence of special purpose rules
and not limited to restricted syntactic contexts. Results from
these experiments also show the impact that varying levels
of documentation (including both internal documentation
such as the requirements and design, and external, or user-level, documentation) have on the algorithm’s performance.
0 Replies
Loading