Subtle Topic Models and Discovering Subtly Manifested Software Concerns Automatically

Mrinal Das, Suparna Bhattacharya, Chiranjib Bhattacharyya, K. Gopinath

Published: 31 May 2013, Last Modified: 30 Mar 2026ICMLEveryonearXiv.org perpetual, non-exclusive license

Abstract: In a recent pioneering approach LDA was used to discover cross cutting concerns(CCC) automatically from software codebases. LDA though successful in detecting prominent concerns, fails to detect many useful CCCs including ones that may be heavily executed but elude discovery because they do not have a strong prevalence in source-code. We pose this problem as that of discovering topics that rarely occur in individual documents, which we will refer to as subtle topics. Re cently an interesting approach, namely fo cused topic models(FTM) was proposed in (Williamson et al., 2010) for detecting rare topics. FTM, though successful in detecting topics which occur prominently in very few documents, is unable to detect subtle top ics. Discovering subtle topics thus remains an important open problem. To address this issue we propose subtle topic models(STM). STM uses a generalized stick breaking pro cess(GSBP) as a prior for defining multiple distributions over topics. This hierarchical structure on topics allows STM to discover rare topics beyond the capabilities of FTM. The associated inference is non-standard and is solved by exploiting the relationship be tween GSBP and generalized Dirichlet distri bution. Empirical results show that STM is able to discover subtle CCC in two benchmark code-bases, a feat which is beyond the scope of existing topic models, thus demon strating the potential of the model in auto mated concern discovery, a known difficult problem in Software Engineering. Further more it is observed that even in general text corpora STM outperforms the state of art in discovering subtle topics.