This thesis describes a study performed in an industrial setting that attempts to build predictive models to identify parts of a Java system with a high fault probability. The system under consideration is constantly evolving as several releases a year are shipped to customers. Developers usually have limited resources for their testing, so our aim was to build optimal and practically useful fault-proneness prediction models to help focus verification and validation activities on the most fault-prone components of this system.
This thesis starts off with a literature review that provides detailed discussions of the state-of-the-art of research on fault-proneness prediction models. The review revealed that a vast number of modeling techniques have been used to build such prediction models. However, there has been little systematic effort on assessing the impact of selecting a particular modeling technique. Furthermore, there has been no systematic study of the impact of including certain, alternative types of measures as predictors. Finally, many studies apply certain evaluation methods and model assessment criteria that, depending on the intended use of the prediction model, might be insufficient or even inappropriate. Consequently, the main research focus of this thesis is to systematically assess three aspects on how to build and evaluate fault-proneness models in the context of a large Java legacy system development project: (1) compare many data mining and machine learning techniques to build fault-proneness models, (2) assess the impact of using different metric sets such as source code structural measures and historic change/fault (process) measures, and (3) compare several alternative ways of assessing the performance of the models, in terms of (i) confusion matrix criteria such as accuracy and precision/recall, (ii) ranking ability, using the receiver operating characteristic area (ROC), and (iii) our proposed cost-effectiveness measure (CE).
The results of the study indicate that the choice of modeling technique has limited impact on the resulting classification accuracy or cost-effectiveness. There is however large differences between the individual metric sets in terms of cost-effectiveness, and although the process measures are among the most expensive ones to collect, including them as candidate measures significantly improves the prediction models compared with models that only include structural measures and/or their deltas – both in terms of ROC area and in terms of cost-effectiveness. Furthermore, we observe that what is considered the best model is highly dependent on the criteria that are used to evaluate and compare the models. The regular confusion matrix criteria, although popular, are not clearly related to the problem at hand, namely the cost-effectiveness of using fault-proneness prediction models to focus verification efforts to deliver software with less faults at less cost. Consequently, to assess the usefulness of prediction models, we consider the regular confusion matrix criteria of less importance, and recommend to rather use ROC and our proposed measure of cost-effectiveness. Another contribution of this thesis is the provision of a statistically based method for the systematic comparison of fault-proneness prediction models. The method can be reused in future studies to guide the selection of optimal prediction models.