Subtle Topic Models and Discovering Subtly Manifested Software Concerns Automatically
Abstract: In a recent pioneering approach LDA was
used to discover cross cutting concerns(CCC)
automatically from software codebases. LDA
though successful in detecting prominent
concerns, fails to detect many useful CCCs
including ones that may be heavily executed
but elude discovery because they do not have
a strong prevalence in source-code. We pose
this problem as that of discovering topics
that rarely occur in individual documents,
which we will refer to as subtle topics. Re
cently an interesting approach, namely fo
cused topic models(FTM) was proposed in
(Williamson et al., 2010) for detecting rare
topics. FTM, though successful in detecting
topics which occur prominently in very few
documents, is unable to detect subtle top
ics. Discovering subtle topics thus remains
an important open problem. To address this
issue we propose subtle topic models(STM).
STM uses a generalized stick breaking pro
cess(GSBP) as a prior for defining multiple
distributions over topics. This hierarchical
structure on topics allows STM to discover
rare topics beyond the capabilities of FTM.
The associated inference is non-standard and
is solved by exploiting the relationship be
tween GSBP and generalized Dirichlet distri
bution. Empirical results show that STM is
able to discover subtle CCC in two benchmark code-bases, a feat which is beyond the
scope of existing topic models, thus demon
strating the potential of the model in auto
mated concern discovery, a known difficult
problem in Software Engineering. Further
more it is observed that even in general text
corpora STM outperforms the state of art in
discovering subtle topics.
Loading