-
Notifications
You must be signed in to change notification settings - Fork 0
/
4_empirical.tex
166 lines (136 loc) · 9.27 KB
/
4_empirical.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
\chapter{Empirical Study Design}
%Start with a short description about the goal/research questions
% from a different point of view: Hypothesis. H1 it's possible to create a high quality dataset SATD/fixed using GitHub Java open-source projects and SATD pattern matching
% H2 it's possible
The goal of this exploratory study is to evaluate the accuracy of our approach for technical debt identification.
This section provide details about the design and planning of the study aimed at assessing our model performances.
In this study we answer the following research questions:
\begin{itemize}
\item \textbf{RQ$_1$} \textit{How does our approach perform in detecting technical debt in unseen methods source code?} This RQ investigates the accuracy of the classifier in identifying technical debt in our dataset.
\item \textbf{RQ$_2$} \textit{What is the correlation between the prediction confidence level and the accuracy of the model?} With this RQ we want to evaluate the usefulness of leveraging the confidence threshold as an indicator of the quality of the predictions. Besides reporting quantitative data, we also qualitatively analyze cases of correct and wrong prediction in the ``high confidence'' scenario. Such an analysis provides lessons learned useful at improving our model.
% \item \textbf{RQ$_3$} \textit{When the prediction confidence level is high, what are the reasons the classifications are correct and incorrect?} With this RQ we want to find possible causes that led the classifier to fail or succeed and explain why this happened in the context of a high level of confidence.
\end{itemize}
%4.1
\section{Context Selection}
% Context Selection [explain the dataset used for the evaluation]
%
%select dbsatds.parent_count,dbsatds.valid , dbsatds.accept, count(*) from dbsatds group by 1,2,3
% 93,061: parent_count, valid, accept = 1
%
% select count(distinct url) from dbsatds where dbsatds.parent_count = 1 AND dbsatds.valid = 1 AND dbsatds.accept = 1
%19,395
%select success,done, count(*) from dbrepos group by 1 ,2
%success|done|count |
%-------|----|------|
% 0| 1| 3629|
% 1| 1|245243|
%tot 248'872
% select count(*) from dbsatds = 141400
% select dbsatds.parent_count,dbsatds.valid , dbsatds.accept, count(*) from dbsatds group by 1,2,3 =
% parent_count|valid|accept|count|
% ------------|-----|------|-----|
% 1| 1| 1|93061|
% 2| 1| 1|26223|
% 1| 0| 1|11762|
% 2| 0| 1| 4015|
% 1| 0| 0| 3802|
% 2| 0| 0| 844|
% 1| 1| -1| 423|
% 1| 1| -2| 373|
% 1| 0| -1| 225|
% 2| 1| -1| 175|
% 2| 1| -2| 115|
% 3| 1| 1| 83|
% 1| 1| 0| 82|
% 2| 0| -1| 68|
% 1| 0| -2| 39|
% 3| 0| 1| 22|
% 2| 1| 0| 17|
% 5| 1| 1| 15|
% 4| 1| 1| 14|
% 5| 0| 1| 7|
% 2| 0| -2| 5|
% 4| 0| 1| 5|
% 6| 1| 1| 3|
% 21| 0| 1| 3|
% 7| 0| 1| 2|
% 8| 1| 1| 2|
% 10| 1| 1| 2|
% 12| 1| 1| 2|
% 21| 1| 1| 2|
% 3| 0| -2| 1|
% 3| 0| -1| 1|
% 3| 1| -1| 1|
% 6| 0| 0| 1|
% 6| 0| 1| 1|
% 8| 0| 1| 1|
% 10| 0| 1| 1|
% 12| 0| 1| 1|
% 32| 1| 1| 1|
% The context of the study consists of 141,400 method bodies annotated with keyword label SATD and their 141,400 fixed counterparts.
%The context of the study consists of 93,061 eligible
The context of the study consists of 245,243 public Java GitHub repositories.
% for context selection: The total number of repositories scanned is 245,243
We targeted GitHub open-source projects written in Java because of three reasons:
\begin{itemize}
\item The large number of accessible projects available.
\item The availability of the commits history.
\item The ease of parsing Java sources thanks to feature rich options of modern parsers.
\end{itemize}
We used GitHub GraphQL API \footnote{\url{https://docs.github.com/graphql}} to retrieve 7,265,342 URLs of public Java repositories created between the start of year 2000 and the end of 2019, thus spanning a time window of 20 years.
The GraphQL API allowed us to request two additional information for each repository: the number of issues and the number of commits. To exclude smaller and less meaningful repositories, we kept only those URLs with issue or commit count greater than 100. This reduced the number of repository URLs to 248,872.
After the clone phase we were left with 245,243 repositories because 3,629 were rejected or failed: 1,726 Android OS repository clones were rejected, 587 clones failed because they were removed or made private and 1,316 for other mixed reasons.
The processing of 245,243 repository commit histories yielded 141,400 method bodies annotated with keyword label SATD and their 141,400 fixed counterparts. For each SATD/fixed pair we collected the following information:
\begin{itemize}
\item Pattern: the word or sentence that identified the SATD.
\item Commit message: the commit message that removed the SATD.
\item Commit hash id.
\item Repository name and url.
\item The verbatim source code of the methods body involved in the change:
\begin{itemize}
\item Before the commit, affected with SATD.
\item After the commit, not affected with SATD, i.e. fixed.
\end{itemize}
\item The `cleaned' version of the bodies from the point above (e.g. without comments. See section \ref{sec:cleaning} for details).
\end{itemize}
Of the 141,400 collected samples, 48,339 were rejected leaving us with 93,061 viable pair. Those rejections were enforced for the following reasons:
\begin{itemize}
\item Merge commits (i.e. commits with more than one parent). The reason being that we were specifically looking for a commit fixing a method affected with SATD and a `merge' operation is not likely to do that.
\item Empty methods.
\item Methods containing inner methods.
\item Methods containing only one instruction being it a `throw exception' statement.
\item Methods with unparsable statements.
\end{itemize}
We selected two different main datasets. The first bigger for the hyperparameter tuning experiment: we filtered for token\_count less than 200; the number of viable pair went from 93,061 to 53,136. The second is a smaller dataset, for additional qualitative analysis, filtered for token\_count less than 50; the number of viable pair went from 93,061 to 9,422.
%
For all the training and testing we always used the following split:
\begin{itemize}
\item 75\% training dataset
\item 15\% test dataset
\item 15\% validation dataset
\end{itemize}
%4.2
\section{Data Collection and Analysis}
%Data Collection and Analysis [How you compute the results]
% questa sezione deve contenere come abbiamo raccolto i dati per rispondere alla RQ e come abbiamo fatto a misurare tale cosa, quindi le metriche per fare l'assess delle performances.
% per tradurre in contenuto la frase precednet:
To answer RQ$_1$ we built a classifier using code2vec and trained it with a dataset of SATD snippets and their fixes.
The training was conducted using 20 epochs; the test performance results were measured on the best epoch as reported by the validation set highest accuracy.
We collected the test accuracy for multiple run with incremental snippet length; the snippet length is measured in token count. Then we discuss how the snippet length influences the model results.
To answer RQ$_2$ we tracked all confidence level for the experiments with token count 50 and 200. Then we report the metrics value for the following confidence thresholds: any, 0.6, 0.7, 0.8 and 0.9; For each of them we compare and comment the precision, recall and F1-score. We present via box plot the prediction count for the different classes (true positive, true negative, false positive and false negative).
%% Bavota-8
%To answer RQ$_3$
%We %trained and
To address the qualitative analysis of RQ$_2$, as we will detail later, we trained and tested multiple models in two scenarios with different code snippet lengths; the analysis focus on samples with high confidence predictions. We trace the AST-paths and the related attention vector weights back to the training instances in order to show and explain reasons of specific outcomes.
%4.3
\section{Replication Package}
%Replication Package [A link to a repo with all data and code]
A replication package is available on GitHub:
\begin{itemize}
\item The source code for repository mining and deep learning model\footnote{\url{https://github.com/simonegiacomelli/code2vec-satd-classifier}}.
\item Two PostgreSQL database backups\footnote{\url{https://github.com/simonegiacomelli/code2vec-satd-classifier-dataset}} containing:
\begin{itemize}
\item The dataset with the mined GitHub repository urls and all the method bodies collected.
\item The Optuna database containing the data of the distributed experiment (i.e. hyperparameters tuning) with all Keras outputs.
\end{itemize}
\end{itemize}