Abstract:
Code smells can degrade software quality over time and the probability of change proneness
or fault proneness is higher in the software having code smells as compared to software having
no code smells. If the code smells are not perceived in the initial phases of software
development, the effort required to remove issues caused by them grows rapidly. Many code
smells are found in literature, and the detection of these code smells is not easy. Due to this,
numerous methods for detecting these design defects are studied and proposed previously.
Several automated approaches based on machine learning and deep learning have been
implemented to detect code smells which improve software quality. These code smell
detections models consider limited number of smells and classify code smells into binary
classes.
This thesis proposes a multi-class classification-based code smell detection system considering
considerable code smells to overcome these issues. The proposed system detects code smells
by analyzing the code metrics. The system is designed with ensemble machine learning and
deep learning algorithms with the determination of improving performance. Our system is
designed in two stages: pre-processing and processing. The pre-processing step consists of
dataset collection, dataset cleaning, transformation, label encoding and one hot encoding. To
experimentally evaluate our system, we use Fontana et al. publicly available dataset with
extracted metrics of Qualitus Corpus of software systems. The processing step comprises of
implementing classifiers and evaluating the results. In particular, we implement two ensemble
machine learning classifiers which include Decision Tree, Random Forest, Support Vector
Machine, Naïve Bayes and Logistic Regression. We also implement deep learning classifier,
feed features as an input and analyze the results. We perform multi-class-classification of code
smells and evaluate results using multiple evaluation measures. Besides, the results of best
performing model are cross-validated using k folds cross-validation.
Our system can detect six code smells: Long Method, Feature Envy, Long Parameter List and
Switch Statement at method level, God Class and Data Class at class level. The comparative
analysis of experimental results demonstrates that Artificial Neural Network achieves highest
score of 99.57% accuracy at method level and 98.77% accuracy at class level.