-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathREADME.txt
340 lines (284 loc) · 18.5 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
---------------------------------------------------------------
-- BudgetedSVM: A Toolbox for Large-scale SVM Approximations --
---------------------------------------------------------------
BudgetedSVM is a simple, easy-to-use, and efficient software for large-scale,
non-linear, budgeted classification through approximations of SVM models.
This document explains the use of BudgetedSVM. For Matlab/Octave interface
help please see "./matlab/readme.txt"
Please read the "./LICENSE.txt" license file before using BudgetedSVM.
Also, the toolbox includes two source files ("./src/libsvmwrite.c" and
"./src/libsvmread.c") and uses some code from LibSVM package, please
read "./COPYRIGHT.txt" before using BudgetedSVM for terms and conditions
pertaining to these parts of the toolbox.
Table of Contents
=================
- Table of Contents
- Version history
- The implemented algorithms
- Installation and Data Format
- "budgetedsvm-train" Usage
- "budgetedsvm-predict" Usage
- Examples
- Library Usage
- Matlab/Octave Interface
- Additional Information
- Acknowledgments
Version history
===============
The current version of the software is v1.2. Compared to the previous v1.1 version,
we have added the following changes:
- Added implementation of Growing AMM (GAMM) algorithm.
- The GAMM algorithm is an extension of the AMM method, controlled by the addition
of two additional parameters: cloning probability and cloning probability decay.
The current version of the software is v1.1. Compared to the previous v1.0 version,
we have added the following changes:
- (OBSOLETE) The software is no longer published under GPL v3 license, instead we
publish BudgetedSVM under less restrictive LGPL v3 license.
- The software is no longer published under LGPL v3 license, instead we
publish BudgetedSVM under less restrictive Modified BSD license.
- We changed the way certain parameters are set in the option string, please
be very careful when using earlier scripts for running BudgetedSVM. Refer
to "'budgetedsvm-train' Usage" and "'Budgetedsvm-predict' Usage" sections of
this readme file, or to the command prompt help for more details. For example,
kernel width is no longer specified with '-G', instead use '-g', etc.
- Added '-r' option for turning on/off randomization of the training data.
- Added a number of kernel functions to be used with kernel-based algorithm, such
as exponential, sigmoid, polynomial kernels. The parameters of the kernels
are controlled with '-g', '-d', and '-i' options. Please refer to
"'budgetedsvm-train' Usage" section of this readme file, or to the command
prompt help for more details.
- No longer necessary to specify the dimensionality of the data. However, note
that directly specifying the dimensionality through '-D' option can result
in faster loading times.
- In BSGD, we check if there are two identical vectors in the support vector set and
either merge those two in the case of merging strategy, or remove the newer
one in the case of random removal strategy, leading to better performance.
- Fixed the randomization of LLSVM, previously if randomization was switched off
you could still obtain different results between runs.
- Added class scores to the outputs; please see '"budgetedsvm-predict" Usage' for
more details regarding these output scores for different algorithms.
- Many smaller code changes, that hopefully resulted in better, more readable code.
- Bug fixes (Thanks to all users who reported bugs and provided feedback!).
The implemented algorithms
==========================
The BudgetedSVM toolbox implements Pegasos, Adaptive Multi-hyperplane Machines (AMM) and
Growing AMM, Low-rank Linearization SVM (LLSVM), and Budgeted Stochastic Gradient Descent (BSGD)
algorithms. An overview of the algorithm properties is given in the table below:
--------------------------------------------------------------------------------------------------------------
| Algorithm | Classifier type | Multi-class? | Available kernels |
==============================================================================================================
| Pegasos | Linear | Multi-class | Linear |
| (G)AMM | Non-linear | Multi-class | Linear |
| LLSVM | Non-linear | Binary | Any |
| BSGD | Non-linear | Multi-class | Any for random removal, Gaussian when merging support vectors |
--------------------------------------------------------------------------------------------------------------
For more details, please see their respective published papers. In particular,
the publications can be found here:
*** "Pegasos: primal estimated sub-gradient solver for SVM", ICML 2007 (Pegasos, found at
http://link.springer.com/article/10.1007/s10107-010-0420-4)
*** "Trading Representability for Scalability: Adaptive Multi-Hyperplane Machine for
Nonlinear Classification", KDD 2011 (AMM, found at "./doc/pdfs_of_algorithm_papers/AMM_paper.pdf")
*** "Growing Adaptive Multi-Hyperplane Machines", ICML 2020 (GAMM, found at "./doc/pdfs_of_algorithm_papers/GAMM_paper.pdf")
*** "Scaling up Kernel SVM on Limited Resources: A Low-rank Linearization Approach", AISTATS 2012
(LLSVM, found at "./doc/pdfs_of_algorithm_papers/LLSVM_paper.pdf")
*** "Breaking the Curse of Kernelization: Budgeted Stochastic Gradient Descent for
Large-Scale SVM Training", JMLR 2012 (BSGD, found at "./doc/pdfs_of_algorithm_papers/BSGD_paper.pdf")
For our BudgetedSVM paper (JMLR 2013) which gives an overview of the toolbox and summarizes its
main features, please see a PDF file at "./doc/pdfs_of_algorithm_papers/BudgetedSVM_paper.pdf".
Installation and Data Format
============================
On Unix systems, type `make' to build the `budgetedsvm-train' and `budgetedsvm-predict'
programs. Type 'make clean' to delete the generated files. Run the programs without
arguments for description on how to use them.
We note that the authors have tested the toolbox on the following platform with success:
>> gcc -v
gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)
Data format
-----------
The format of training and testing data file is as follows:
<label> <index1>:<value1> <index2>:<value2> ...
.
.
.
Each line contains an instance and is ended by a '\n' character. For
classification, <label> is an integer indicating the class label
(multi-class is supported). See './a9a_train.txt' for an example.
For further details about LIBSVM format please see the following webpage
http://www.csie.ntu.edu.tw/~cjlin/libsvm
`budgetedsvm-train' Usage
=========================
In order to get the detailed usage description, run the budgetedsvm-train function
without providing any arguments to obtain the following instructions:
Usage:
budgetedsvm-train [options] train_file [model_file]
Inputs:
options - parameters of the model
train_file - url of training file in LIBSVM format
model_file - file that will hold a learned model
--------------------------------------------
Options are specified in the following format:
'-OPTION1 VALUE1 -OPTION2 VALUE2 ...'
Following options are available; affected algorithm and default
values in parentheses (algorithm not specified if option affects all):
A - algorithm, which large-scale SVM approximation to use (2):
0 - Pegasos
1 - AMM batch
2 - AMM online
3 - LLSVM
4 - BSGD
D - dimensionality (faster loading if set, if omitted inferred from the data)
B - limit on the number of weights per class in AMM, OR
total SV set budget in BSGD, OR number of landmark points in LLSVM (50)
L - lambda regularization parameter; high value -> less complex model (0.00010)
b - bias term, if 0 no bias added (1.0)
e - number of training epochs (AMM, BSGD; 5)
s - number of subepochs (AMM batch; 1)
k - pruning frequency, after how many observed examples is pruning done (AMM; 10000)
c - pruning threshold; high value -> less complex model (AMM; 10.00)
K - kernel function (0 - RBF; 1 - exponential, 2 - polynomial; 3 - linear,
4 - sigmoid; 5 - user-defined) (LLSVM, BSGD; 0)
g - RBF or exponential kernel width gamma (LLSVM, BSGD; 1/DIMENSIONALITY)
d - polynomial kernel degree or sigmoid kernel slope (LLSVM, BSGD; 2.00)
i - polynomial or sigmoid kernel intercept (LLSVM, BSGD; 1.00)
m - budget maintenance in BSGD (0 - removal; 1 - merging, uses Gaussian kernel), OR
landmark sampling strategy in LLSVM (0 - random; 1 - k-means; 2 - k-medoids) (1)
C - clone probability when misclassification occurs in AMM (0)
y - clone probability decay when misclassification occurs in AMM (0.99)
z - training and test file are loaded in chunks so that the algorithms can
handle budget files on weaker computers; z specifies number of examples
loaded in a single chunk of data (50000)
w - model weights are split in chunks, so that the algorithm can handle
highly dimensional data on weaker computers; w specifies number of
dimensions stored in one chunk (1000)
S - if set to 1 data is assumed sparse, if 0 data assumed non-sparse; used to
speed up kernel computations (default is 1 when percentage of non-zero
features is less than 5%, and 0 when percentage is larger than 5%)
r - randomize the algorithms; 1 to randomize, 0 not to randomize (1)
v - verbose output; 1 to show the algorithm steps, 0 for quiet mode (0)
--------------------------------------------
The model is saved in a text file which has the following rows:
[ALGORITHM, DIMENSION, NUMBER OF CLASSES, LABELS, NUMBER OF WEIGHTS, BIAS TERM, KERNEL WIDTH, MODEL]
In order to compress memory and to use the memory efficiently, we coded the model in the following way:
For AMM batch, AMM online, PEGASOS: The model is stored so that each row of the text file corresponds
to one weight. The first element of each weight is the class of the weight, followed by the degradation
of the weight. The rest of the row corresponds to non-zero elements of the weight, given as
feature_index:feature_value, in a standard LIBSVM format.
For BSGD: The model is stored so that each row corresponds to one support vector (or weight). The
first elements of each weight correspond to alpha parameters for each class, given in order of
"labels" member of the Matlab structure. However, since alpha can be equal to 0, we use LIBSVM format
to store alphas, as -class_index:class-specific_alpha, where we added '-' (minus sign) in front of
the class index to differentiate between class indices and feature indices that follow. After the
alphas, in the same row the elements of the weights (or support vectors) for each feature are given
in LIBSVM format.
For LLSVM: The model is stored so that each row corresponds to one landmark point. The first element of
each row corresponds to element of linear SVM hyperplane for that particular landmark point. This is
followed by features of the landmark point in the original feature space of the data set in LIBSVM format.
`budgetedsvm-predict' Usage
===========================
In order to get the detailed usage description, run the budgetedsvm-predict function
without providing any arguments to obtain the following instructions:
Usage:
budgetedsvm-predict [options] test_file model_file output_file
Inputs:
options - parameters of the model
test_file - url of test file in LIBSVM format
model_file - file that holds a learned model
output_file - url of file where output will be written
--------------------------------------------
Options are specified in the following format:
'-OPTION1 VALUE1 -OPTION2 VALUE2 ...'
The following options are available (default values in brackets):
z - the training and test file are loaded in chunks so that the algorithm can
handle budget files on weaker computers; z specifies number of examples
loaded in a single chunk of data (50000)
w - the model weight is split in parts, so that the algorithm can handle
highly dimensional data on weaker computers; w specifies number of
dimensions stored in one chunk (1000)
S - if set to 1 data is assumed sparse, if 0 data assumed non-sparse, used to
speed up kernel computations (default is 1 when percentage of non-zero
features is less than 5%, and 0 when percentage is larger than 5%)
o - if set to 1, the output file will contain not only the class predictions,
but also tab-delimited scores of the winning class (0)
v - verbose output; 1 to show algorithm steps, 0 for quiet mode (0)
--------------------------------------------
When setting the '-o' option, the scores should be interpreted as follows. For
LLSVM the score represents the distance of test example from the separating
hyperplane; for AMM and BSGD this score represents difference between the
winning-class score and the score of a class that had the second-best score.
Examples
========
Here is a simple example on how to train and test a classifier on the provided adult9a data set,
after budgetedsvm-train and budgetedsvm-predict functions were compiled by running 'make'.
Note that the train and predict programs will be created in the "./bin" folder, which is
why we need to append "bin/" to the calls to the functions. If the programs are run in Windows,
a user should use "\" (back-slash) instead of "/" (forward-slash) when specifying the path
to the programs in the command prompt. In all examples below, algorithms should return accuracy
roughly around 15%.
How to train and test AMM or GAMM:
----------------------------------
>> bin/budgetedsvm-train -A 1 -e 5 -L 0.001 -B 20 -D 123 -v 1 -k 10000 -c 10 a9a_train.txt a9a_model.txt
>> bin/budgetedsvm-predict -v 1 a9a_test.txt a9a_model.txt a9a_preds.txt
The first command uses AMM batch ("-A 1") algorithm to train multi-hyperplane machine for 5 epochs ("-e 5"),
using regularization parameter lambda of 0.001 ("-L 0.001", larger values result in less complex model, or,
in other words, more regularization) and setting the maximum number of weights per class to 20 ("-B 20").
As adult9a data set is of dimensionality 123, we also write "-D 123", and choose verbose output ("-v 1")
which prints detailed steps of the algorithm. Lastly, we specify that pruning of weigths will be performed
every 10,000 iterations ("-k 10000", smaller values results in more aggressive pruning), and the pruning
parameter is set to 10 ("-c 10", larger values result in more aggressive pruning). If you did not specify
a name for the model file, it will be created such that suffix '.model' is appended to the filename of the
training file (note that we did include the model filename in the above example, namely 'a9a_model.txt').
The second command tests the model on testing data set, and prints the accuracy on the testing set while
saving the predictions to 'a9a_preds.txt'. We also set verbose output by writing "-v 1".
If you would like to train a GAMM model, this can be done through the additional parameter "-C". For example "-C 0.2" sets the cloning probability to 0.2, while the cloning probability decay defaults to 0.99.
How to train and test LLSVM:
----------------------------
>> bin/budgetedsvm-train -A 3 -L 0.1 -K 0 -g 0.01 -B 100 -m 1 -D 123 -v 1 a9a_train.txt a9a_model.txt
>> bin/budgetedsvm-predict -v 1 a9a_test.txt a9a_model.txt a9a_predictions.txt
The first command uses LLSVM ("-A 3") algorithm to train a classification model, setting
regularization parameter to 0.1 ("-L 0.1", larger values result in less complex model, or,
in other words, more regularization), which result in higher regularization than in
the AMM case described above. We use Gaussian kernel ("-K 0") with kernel width 0.01 ("-g 0.01").
With "-B 100" option we set the budget, specifying that the model will comprise 100 landmark
points which will be chosen by running k-means on the loaded training data ("-m 1"). As adult9a
data set is of dimensionality 123, we also write "-D 123", and choose verbose output ("-v 1")
which prints detailed steps of the algorithm. If you did not specify a name for the model file,
it will be created such that suffix '.model' is appended to the filename of the training file
(note that we did include the model filename in the above example, namely 'a9a_model.txt').
The second command evaluates the model on testing data set, and prints the accuracy on the
testing set while saving the predictions to 'a9a_predictions.txt'.
How to train and test BSGD:
---------------------------
>> bin/budgetedsvm-train -A 4 -g 0.01 -e 5 -L 0.0001 -B 200 -m 1 -D 123 -v 1 a9a_train.txt a9a_model.txt
>> bin/budgetedsvm-predict -v 1 a9a_test.txt a9a_model.txt a9a_predictions.txt
The first command uses BSGD ("-A 4") algorithm to train a classification model for 5 epochs
("-e 5"), using learning rate lambda of 0.0001 ("-L 0.0001", larger values result in less complex model,
or, in other words, more regularization) and kernel width of Gaussian kernel of 0.01 ("-g 0.01").
With "-B 200" option we specifying that the model will comprise 200 support vectors,
and in the case of a budget overflow two support vectors will be merged ("-m 1") to maintain
the budget. As adult9a data set is of dimensionality 123, we also write "-D 123", and choose
verbose output ("-v 1") which prints detailed steps of the algorithm. If you did not specify a name
for the model file, it will be created such that suffix '.model' is appended to the filename of the
training file (note that we did include the model filename in the above example, namely 'a9a_model.txt').
The second command evaluates the model on testing data set, and prints the accuracy
on the testing set while saving the predictions to 'a9a_predictions.txt'.
Library Usage
=============
See the "./doc/BudgetedSVM_reference_manual.pdf" or open "./doc/html/index.html" in your browser
for details about the implementation.
Matlab/Octave Interface
=======================
Please check the README.txt file in the "./matlab" directory for more information.
Additional Information
======================
The toolbox was written by Nemanja Djuric, Liang Lan, and Slobodan Vucetic
from the Department of Computer and Information Sciences, Temple University,
together with Zhuang Wang from IBM Global Business Services.
BudgetedSVM webpage is located at http://www.dabi.temple.edu/budgetedsvm/.
If you found our work useful, please cite us as:
Djuric, N., Lan, L., Vucetic, S., & Wang, Z. (2014). BudgetedSVM: A Toolbox for Scalable
SVM Approximations. Journal of Machine Learning Research, 14, 3813-3817.
For any questions or comments, please contact Nemanja Djuric at <[email protected]>.
Last updated: June 21st, 2020
Acknowledgments
===============
This work was supported by the National Science Foundation via the grants IIS-0546155 and IIS-1117433.