-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathOpenQuestions.tex
executable file
·344 lines (301 loc) · 16.5 KB
/
OpenQuestions.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
\documentclass{sig-alternate}
\usepackage{array}
\usepackage{pifont}
\usepackage{url}
\usepackage{graphicx}
\usepackage{multirow}
\newcommand{\none}{\ding{55}}
\newcommand{\least}{\ding{51}}
\newcommand{\little}{\ding{51}\ding{51}}
\newcommand{\lots}{\ding{51}\ding{51}\ding{51}}
\begin{document}
\pagenumbering{arabic}
\title{Open Questions for Computer Science and Cheminformatics}
\numberofauthors{9}
\author{
\alignauthor
Joerg Kurt Wegner\\
\affaddr{Tibotec BVBA}\\
\affaddr{Turnhoutseweg 30}\\
\affaddr{2340 Beerse Turnhout, Belgium}\\
\email{[email protected]}
% 2nd. author
\alignauthor
Aaron Sterling\\
\affaddr{Department of Computer Science}\\
\affaddr{Iowa State University}\\
\affaddr{Ames, Iowa, USA}\\
\email{[email protected]}
% 3rd author
\alignauthor
Rajarshi Guha\\
\affaddr{NIH Center for Translational Therapeutics}\\
\affaddr{9800 Medical Center Drive}\\
\affaddr{Rockville, MD 20850}\\
\email{[email protected]}
}
\additionalauthors{Additional authors:
Andreas Bender (University of Cambridge, email: {\texttt{[email protected]}}),
Jean-Loup Faulon (University of Evry, email: {\texttt{[email protected]}}),
Janna Hastings (European Bioinformatics Institute, Cambridge, UK, email: {\texttt{[email protected]}}),
Noel O'Boyle (University College Cork, Cork, Ireland, email: {\texttt{[email protected]}}),
John Overington (European Bioinformatics Institute, Cambridge, UK, email: {\texttt{[email protected]}}),
Herman Van Vlijmen (Tibotec, Beerse, Belgium, email: {\texttt{[email protected]}}), and
Egon Willighagen (Karolinska Institutet, Stockholm, Sweden, email: {\texttt{[email protected]}})
.}
\maketitle
%
We offer the following open questions to suggest concrete interdisciplinary research directions for computer scientists working with chemoinformaticians. This list is by no means exhaustive. We have focused on questions that, in our opinion, are answerable, and whose answers would be of great interest to the field. The questions are grouped roughly by computer science subarea, and include both open theoretical problems and ``requests'' for open-source practical implementations. We hope CS researchers new to cheminformatics find this useful.
\section*{Algorithmic graph theory}
\begin{enumerate}
\item \emph{Design an algorithm that approximately counts the number of (3D) conformers of a chemical formula.}
This question encompasses the open questions of \emph{approximately
counting the number of stereoisomers, or the number of tautomers,
of a chemical formula}. Goldberg and Jerrum designed an algorithm
which, given a chemical formula, would output an isomer of that
formula chosen uniformly at random~\cite{RandomlySampling}. Perhaps
the main theoretical obstacle to overcome is that molecules tend to
be \emph{chiral} (that is, they have a ``handedness'' or 3D
orientation), whereas traditional graph theory (and the
Goldberg/Jerrum algorithm) treats graphs with identical vertices and
edges as isomorphic. So part of the question could be rephrased as,
``Given a labeled vertex set, count the number of structures that
are identical on that vertex set, except that they differ in their
3D orientation.'' Mathematical chemists have designed measures of
molecular chirality~\cite{ChiralityMeasures}. A resolution of this
problem may require connecting graph enumeration algorithms to knot
theory or other topics in topology~\cite{TopologicalLook}.
%
\item \emph{Design and implement efficient subgraph isomorphism
algorithms for useful special cases of molecular graphs}.
The Subgraph Isomorphism Problem is known to be
$\textsf{NP}$-complete (hence possibly harder than the Graph
Isomorphism Problem, which is not known to be
$\textsf{NP}$-complete). Nevertheless, finding maximum common
subgraphs to match chemical structures is of fundamental importance
in chemistry; hence, much work has been invested in partial
solutions to this problem. Raymond and Willet reviewed the state of
the art in 2002~\cite{MCSreview}. As one possible way to attack
this problem, we note the empirical fact that molecular graphs are
of bounded degree, and are observed to have \emph{treewidth} $\leq
5$~\cite{treewidth}. Bounded treewidth is at least theoretically
useful~\cite{Epp-JGAA-99}, and it may be possible to improve on
current open-source implementations whose isomorphism-checking
routines are written for all graphs, instead of taking advantage of
special properties of chemical graphs.
\item \emph{Searching within complex data types, e.g. molecules, for semantic web approaches}.
One key concept of the linked data web, the semantic web, is that
different data sources can be readily integrated with each
other. Still, in the field of Cheminformatics, we are not only
interest in linking two molecules (the linking normalization problem
for different protomers, tautomers, or special cases of isomerisms
remain open), but we are also interested in being able to search
efficiently within molecules when being linked via semantic web
approaches. Typical searches will require being able to apply
substructure or similarity searches. What could be algorithmic
solutions for this?
\end{enumerate}
\section*{Data mining \& machine learning}
\begin{enumerate}
\item \emph{Small molecules \& phenotypic data}
Phenotypic (where one takes images of cells and then analyzes them
extract numerical features from the image) screens are increasingly
common. These types of screens lead to very large, very high
dimensional datasets.
\begin{itemize}
\item What methods are suitable to perform feature selection and
modeling of such data
\item Does phenotype-derived data lead to to ``better'' models for
small molecules compared to the usual structure-derived data?
\end{itemize}
\item \emph{Inverse QSAR (or de-novo design)}
(statistical) QSAR is an example of traditional data mining where one
correlates a set of independent variables to a dependent
variable. This lets one predict properties of a new molecule based on
its structural features and those of a training set. The inverse QSAR
problem, also de-novo design, is such that given a molecule structure representation
(usually a descriptor vector), what are the possible input structures
that satisfy the representation. This process might be considered as
a chemical design of experiments. It is clearly a non-continuous optimization problem, since
not all chemical molecules might be accessible, and cost/risk to create molecules phsically is another
critical factor. This can be made more complex, by
asking that the input structures also satisfy an experimental property
range.
\begin{itemize}
\item What methodologies can be devised to address this problem?
\item How can we ensure combinatorically created suggestions make use of the large
chemical structure, chemical building block, and chemical reaction databases by suggesting
chemically feasible molecules (not just virtual accessible ones)? How can we improve
synthetic accessibility predictions \cite{Boda_Seidel_Gasteiger_2007}?
\item How does a given methodology allow us to translate the
descriptor space to a minimal region of the input space?
\item Are there internal features of a modeling algorithm (say
hyperplanes in a SVM approach) that lets us simplify the problem?
\end{itemize}
%\item \emph{Efficient molecule browsing, e.g. on scaffold level}.
%
%Chemical Abstract Services have a molecule browsing tool called SubScape, which allows to browse large-scale
%chemical spaces efficiently. What could be large-scale solutions for doing this within (combined and aligned)
%public databases.
%
\item \emph{Dynamic similarity search on instant binary vectors}.
Binary feature vectors are a common practice for chemical similarity
searches. The typical process starts with 1. creating binary
substructure (or other feature vectors) \cite{citeulike:8530538},
2. creating fixed-length binary vectors of typically 1024 bits for
reducing space requirements and speeding up further similarity
searches (by loosing some accuracy), 3. creating further
pre-computations for speeding up threshold based similarity searches
\cite{doi:10.1021/ci800076s}.\\ Still, if we are interested to employ
dynamic changes in the similarity encodings, e.g. using only a set
of binary features, then previously done hashing or pre-computations
might need to be redone efficiently on the fly. Finally, the major
goal is to employ similarity searches \cite{doi:10.1021/ci200235e}
on a scale of multiple million entries and more optimizations and
benchmarking studies are urgently required, e.g. using GPUs
\cite{doi:10.1021/ci1004948}, or optimizing pair-wise similarity
calculations \cite{MINF:MINF201100050}.
%
\item \emph{Chemical image/text mining in patents (curation)}.
There are various tools for doing automatic text mining on chemical
patents. Still, the overall acceptance rate of chemical text mining
is improvable, since many medicinal chemists are very concerned
about the data quality of such efforts.
\begin{itemize}
\item What could be done to improve the mining quality, curate the
obtained data, and to provide confidence level estimations for
each molecule coming from patent mining?
\item Do require image2structure and text2structure mining also data
stores for ensuring a sufficient amount of confidence and data
quality?
\item How can patent mining be used to create new drugs faster or to
speed-up collaboration/licensing discussions?
\end{itemize}
\item \emph{Large-scale vectorial versus kernels molecule similarity}
Vectorial molecule encodings can serve as efficient approximations
of molecules. Sometimes non-vectorial molecular 3D shape or
molecule kernel comparisons might be more suitabe to compare
molecules, since they might better correlate with activities. One
key problem is that non-vectorial encodings require to compare all
molecules (or their 3D conformational explosions) in a pair-wise
manner. This becomes prohibitively expensive when considering
millions of molecules. Can dyadic data approaches help
\cite{Hochreiter:2006:SVM:1159508.1159516}? Other approximations or
cascading flows?
%
\item \emph{Using multiple annotations for improving molecular mining/predictions (chemogenomics)}
As an example: Biological activities might not be independent of
each other, but have a certain correlation between each other. In
Chemogenomics this is used for creating models of combining
molecules with protein sequences, molecules with active sites of
proteins, or molecules with biological activities of multiple
assays. How can we optimize such highly complex mining scenarios,
especially when considering large-scale data sets with hundred of
thousands molecules and thousands of biological activities? How can
we combine, mine, and visualize categorial and continuous output
variables, e.g. hydrophobicity of a molecule and toxicity in humans,
by still being able to make concrete proposals to medicinal
chemistry? Is analoging (creating very small modifications of a
molecule and measuring its activities) really the most efficient way
forward? If we test molecules, should we test it in a single
biological assay or in multiple biological assays, if multiple,
which ones? If a company does not have a biological assay within
reach, which other partner could offer testing a molecule within two
days (vendor matching based on licenses or contracts)?
\end{enumerate}
\section*{Databases \& software engineering}
\begin{enumerate}
\item \emph{Real time substructure searching in massive chemical
databases}.
Chemical databases have grown tremendously in size. A common task in
such databases is substructure searching. However most cases of very
large structure databases do not support substructure in real time
(something that would allow applications such as
``type-ahead''). How can one enable such rapid substructure searches
on massive ($> 10^9$ molecules) collections of structures? This
problem has various aspects:
\begin{itemize}
\item What type of indexing schemes will support rapid substructure searches?
\item How can other molecular properties be included in substructure
searchs that allow ranking of results, all the while maintaining
real-time response (i.e. $< 1$ sec)?
\item What type of database architectures are required for this
scale of structure searching?
\end{itemize}
\item \emph{Large scale conformational databases}.
While many programs are availabel to generate 3D conformations, on
the fly generation for large collections can be time
consuming. Rather, can we store massive conformer collections and
support 3D searches over them? Beyond generating the conformers
themselves, this problem has several aspects covering database
design, parallel systems and software engineering:
\begin{itemize}
\item How many conformers are required to provide \emph{sufficient} coverage?
\item What representation will be used to store conformers and run
queries? How does the choice of representation affect the type of
queries we can run?
\item Most 3D similarity approaches either employ a vectorial
representation or a volumetric representation. How can these be
efficiently indexed?
\item What type of parallel infrastructure can be used to speed up queries?
\end{itemize}
\item \emph{Database indexing schemes for chemical representations}.
Databases of chemical structures are ubiqituous, employing a variety
of RDBMS products. Structure based queries (exact match, substructure,
similarity) are dependent on the chemical representation - most
solutions employ a linear string based form (SMILES, InChI) and
depending on the nature queries to be supported, some are preferred
over others. While one can perform linear scans, indexing is key to
efficient query performance.
\begin{itemize}
\item Since many queries use binary fingerprints as pre-screens, what
types of indexing schemes can be designed to support queries on
binary vectors?
\item If we choose to support 3D searches (shape, pharmacophore) what
indexing scheme will allow us to perform these types of queries
efficiently?
\end{itemize}
\item \emph{Map/Reduce in cheminformatics}
Many cheminformatics tasks apply algorithms over large input files
or across many molecules (similarity and substructure search,
docking, filtering). A simple way to parallelize this is to chunk
the input data and let individual threads/nodes process each chunk. A
trivial solution is to manually chunk the input and submit a series
of jobs to a scheduling system. Using the Hadoop ecosystem as an
example, we can ask various questions:
\begin{itemize}
\item Can we employ modern frameworks like Hadoop to support
embarassingly parallel cases (filtering $10^7$ molecule libraries)?
\item Can we develop a generalized framework that includes the
requisite cheminformatics tools that allows users to seamlessly
distribute jobs over Hadoop and other map reduce systems?
\item Is the latency involved in Hadoop based datastores (HBase)
worth the ability to handle massive molecule collections and run
M/R queries across them?
\item Going further, are there cheminformatics problems that can make use
of the map/reduce paradigm at the algorithmic level?
\end{itemize}
\end{enumerate}
\section*{Enterprise software (KM,ELN)}
We know that the enterprise software and ELN market is still growing.
\begin{enumerate}
\item \emph{Public-private collaboration and security scenarios}
Let us assume an organization, e.g. a commercial company, has a
single or a small number of established KM and ELN products. How
can we improve the maintenance, leveraging, and collaboration with
many external partners (each of them potentially with another KM/ELN
solution)? Which party is hosting which data in which data structure
(ontologies?), and how can we ensure that only pre-defined data
entries (and a limited number of annotations, e.g. biological
activities) are visible to a partner. How can this be organized for
a multitude of partners? Cloud computing, user management,
encryption granularity and efficient security management?
\item \emph{Licensing in a parallel world} Many software vendors use
different solutions for parallizing compute jobs: SGE, PVM, MPI,
etc. Is a cloud really an option? What about SaaS with secured data
transfer? Can this also offer alternative licensing strategies for
software suites in this domain?
\end{enumerate}
\bibliographystyle{abbrv}
\bibliography{paper}
\end{document}