-
Notifications
You must be signed in to change notification settings - Fork 0
/
5-conclusion.tex
214 lines (181 loc) · 10.2 KB
/
5-conclusion.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
% !TEX root = main.tex
\section{Conclusion}
In this project, we properly defined what it means if a program has unsafe dataflow.
We proposed an algorithm schema that computes if a program has unsafe dataflow.
Furthermore, we proved that any algorithm implementing this schema is sound.
We implemented a database schema and library in the CodeQL framework.
On top of that, we provide a dataflow algorithm that fits our proposed dataflow algorithm schema.
Thus, it is a sound dataflow algorithm.
To close the loop, we demonstrate how to recover a part of the algorithm description
in regular first-order predicate logic.
Furthermore, we extended the production dataflow library employed by CodeQL for analysing
programs in Java by a notion of call sensitivity.
This reduces the number of false positives.
Our extension was accepted upstream and is now part of the CodeQL codebase and deployed
to customers around the world.
\subsection{Future Work}
There are several areas of future work.
The theoretical framework can be improved by making a distinction in the soundness theorem
between sinks that (if they are executed) always will abort the program,
and sinks that are safe to execute in some contexts.
The section about call sensitivity already outlines possible future improvements
on the code for call sensitivity.
Another area which can be investigated is making the dataflow analysis path-sensitive.
\subsection{Path-Sensitive Dataflow}
Initially, the goal of this project was to make the dataflow analysis path-sensitive
to reduce false positives.
Code like in~\autoref{lst:ex-ps-1} is currently (wrongly) detected as having unsafe dataflow
by the CodeQL dataflow library.
\begin{listing}[h]
\begin{javacode}
public void f(boolean b) {
Object o = null;
if (b) {
o = source();
}
if (!b) {
sink(o);
}
}
\end{javacode}
\caption{Simple example where path-sensitive dataflow would reduce false positives}
\label{lst:ex-ps-1}
\end{listing}
However, implementing a path-sensitive dataflow analysis in QL without introducing
introducing soundness issues proved to be too difficult.
We focussed on suppressing nodes from the dataflow graph, instead of edges,
which didn't turn out to be a good approach.
Furthermore, the dataflow graph is a very coarse representation of the program.
It cannot easily be extended with all the information that is needed to track the outcomes
of tests, while still being a compact representation.
After failing with path-sensitivity, we turned to understanding the dataflow problem better.
For that, we developed the formalism introduced in section 2, and we showed how that
theory connects to the practically relevant QL implementation in section 3.
Now that all the theoretical framework is in place, it is much easier to investigate path-sensitivity.
Unfortunately, due to time constraints, we are not able to investigate path-sensitivity here.
\iffalse
\subsection{Path Sensitivity}
The initial goal of this project was to make the dataflow analysis path-sensitive
to reduce false positives. Code like in~\autoref{lst:ex-ps-1} is currentyl detected
as having (potentially) dataflow.
\begin{listing}[h]
\begin{javacode}
public void f(boolean b) {
Object o = null;
if (b) {
o = source();
}
if (!b) {
sink(o);
}
}
\end{javacode}
\caption{Simple example of path-sensitive dataflow}
\label{lst:ex-ps-1}
\end{listing}
\begin{listing}[h]
\begin{javacode}
public void f() {
boolean[] boolArray = {true, false};
Object x = null;
for(boolean b: boolArray) {
if (b) {
x = source();
}
if (!b) {
sink(x);
}
}
}
\end{javacode}
\caption{Example of a false negative with the implemented algorithm}
\label{lst:ps-false-negative}
\end{listing}
However, the initial implementation had a serious soundness problem, that only was discovered
quite late in the implementation process.
Namely, it was wrongly not reporting code like in~\autoref{lst:ps-false-negative} to have dataflow.
It boils down that the implementation was ignoring dataflow in~\autoref{lst:ps-false-negative},
because both variable reads refer to the same variable (even in SSA), but the
variable can have different values in different loop iterations.
This unsoundness was understandably not acceptable to the CodeQL developers.
The focus of the project then shifted to understanding the dataflow algorithm better,
which is when sections 2 and 3 were developed.
Now, using these tools to formalize a path-sensitive dataflow algorithm, proving it sound
and also providing a sound implementation for Java is not feasible in the time-constraints
of this project.
However, using the understanding gained while developing the theoretical part of the project,
we can at least outline a possible solution to the soundness problem.
While this solution has good chances of being sound, it raises performance concerns
that might make it prohibitive to implement in a intraprocedural setting.
\subsubsection*{Changes to the Algorithm}
In order to make the algorithm path-sensitive, we have to look at the labels.
Right now, the algorithm labels nodes with $\labclean$ and $\labtracked$.
This needs to be changed to track labels together with a dynamic set of constraints.
These constraints describe the values of the conditionals that need to be true for a value
to propagate from its creation (a node in the dataflow graph that introduces a new label)
to any given node. Thus, the constraints in the set are viewed as conjunction.
During label propagation, dataflow nodes in a then-branch of an if statement attach the condition of the
if to their labels, and in the else-branch the negation of that if statement.
The same propagation rule is implemented for while loops.
If this rule leads to an unsatisfiable constraint set, i.e.\ the conjunction of its
elements will always evaluate to false, regardless of the variables in the program,
the label is dropped.
An example of an unsatisfiable constraint set would be $\{b, \neg b\}$.
Note that getting constraints from nodes in the the dataflow graph alone would only result in a subset of all available
conditions be attached to the labels, as the dataflow graph is a very coarse subset of
the control-flow graph. Thus, there is a loss of precision here that would only be alleviated
if switching to propagate the dataflow information directly on the control-flow graph.
That would have huge performance implications, that might be untenable in practice.
At $\varphi$-nodes, both incoming labels are preserved if they have different labels or different constraint sets.
If the same label is attached to a node with different constraint sets, it means that
either one of the constraint set holds for that label.
Furthermore, and this is the important difference to our na\"ive implementation suffering from
soundness problems is that constraints in the constraint set sometimes need to be dropped
during dataflow propagation.
Specifically, all constraints that refer to variables that go out of scope need to be dropped.
Determining which labels to drop exactly based on the dataflow graph alone might
be very difficult, if not impossible.
This is because data can flow from a dataflow node to another node,
and all SSA variables referenced in the constraints are live at both nodes.
However, the variables could have been re-defined in the meantime, thus possibly having new values.
This is mainly caused by the dataflow graph skipping over some parts of the
control-flow graph, so label propagation might not see the re-definition of variables.
A main concern for a sound implementation is thus figuring out when variables in the
constraint set go out of scope, and the constraint needs to be dropped.
At the end, dataflow is detected if a sink has at least one label of type $\labtracked$,
regardless of the constraint set.
\subsubsection*{Changes to the Theory}
The theoretical model of dataflow would need to change quite a bit to accomodate
a soundness proof of the algorithm outlined above.
This section is purely speculative, as we haven't done the actual work, and it might turn out
that it turns out entirely different.
In our discussion of the dataflow algorithm without path-sensitivity, we have two slightly
different approaches in theoretical world, that keeps track of annotated types, whereas
the algorithm implementation computes these types by using label sets.
If each label gets a constraint set, and labels of the same type (but with different constraint sets)
can coexist in the same set, this needs to be modelled on the type level as well.
Furthermore, the typing rules need to allow for the possibility of adding constraints at
branch points (i.e.\ while and if statements).
This formulation should be flexible enough that not every condition \emph{needs}
to contribute to the constraint set, so conditions that are too complicated can be skipped
by the implementation.
However, on the other hand, the framework needs to be strong enough to be able to prove that if
two constraints in a constraint set conflict, that the label can be dropped.
Not having a label assigned at a node after the label propagation stops needs somehow to
result in an annotated type that markes variables as clean.
We can conclude this thought experiment by stating that the algorithm implementation and
the algorithm specification (that should be proven to be sound) need a tighter coupling
than the framework presented in this project report provides.
\subsubsection*{Practical Feasibility}
The main concern with path-sensitive extensions of the dataflow algorithm in practice is
that it is unlikely to work purely on the dataflow graph.
Probably more than less involvement of the control-flow graph would be needed.
While an algorithm for that could certainly be developed, and, in a reduced setting like
dIMP, be proven sound, it would probably perform badly.
The size of the control-flow graph is much bigger than the carefully optimized dataflow graph.
Especially in an intraprocedural setting, the whole control-flow graph of a program
consisting of several millions of lines of code is too big run analyses on.
Thus, without further research and some good ideas, it is probably not a good idea
to spend significant ressources on making an intraprocedural dataflow algorithm path-sensitive
as outlined above.
\fi