forked from ShifuML/shifu
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathCHANGES.txt
325 lines (305 loc) · 18.6 KB
/
CHANGES.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
/**
* Copyright [2012-2014] PayPal Software Foundation
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
Shifu Change Log
Changes for Shifu-0.10.5
* Optimize IndependetTreeModel by decreaseing model memory to 70% and CPU time to 90%
* Upgrade guagua to 0.7.0 t fix a bug on empty gzip files in one worker
Changes for Shifu-0.10.4
* Optimize IndependetTreeModel by split regression and classification;
* Add new version of fast correlation computing.
Changes for Shifu-0.10.3
* Fix GBT SLA Categorical Feature Rebin Delimiter Issue: change delimiter to '@^'
Changes for Shifu-0.10.2
* Fix GBT SLA issue: pre-parse double types for only once.
Changes for Shifu-0.10.1
* Fix one big bug on 'baggingWithReplacement':
https://github.com/ShifuML/shifu/issues/335
Changes for Shifu-0.10.0
* Tree Ensemble Model Improvement
a) Speed GBT Training
https://github.com/ShifuML/shifu/issues/252
b) Auto Skip Features with only One Bin
https://github.com/ShifuML/shifu/issues/276
c) Cover GBT Regression Score To Probability
https://github.com/ShifuML/shifu/issues/254
d) Add Early Stop Feature for GBT
https://github.com/ShifuML/shifu/issues/230
e) By Default Disable Tmp Model Output in NN and GBT, RF
https://github.com/ShifuML/shifu/issues/231
f) GBT & RF PMML Support
https://github.com/ShifuML/shifu/issues/232
g) Grid Search: Compute validation error on latest 10 or 20 iterations
https://github.com/ShifuML/shifu/issues/233
h) Missing Value Processing in Tree Model
https://github.com/ShifuML/shifu/issues/239
i) Make Tree Model Without Dependency
https://github.com/ShifuML/shifu/issues/253
j) Compress Tree Model by Gzip to Save Size
https://github.com/ShifuML/shifu/issues/272
* Train Step Improvement
a) Sampling Logic Change in Training
https://github.com/ShifuML/shifu/issues/310
b) Add Stratified Sampling in Training Step
https://github.com/ShifuML/shifu/issues/311
c) Add Cross Validation in Train Step
https://github.com/ShifuML/shifu/issues/312
d) Guagua Job Failed Improvement
https://github.com/ShifuML/shifu/issues/237
e) Disable Tmp Model Output in NN and GBT, RF
https://github.com/ShifuML/shifu/issues/231
f) Support Redo Training Without Weight after Weighted Norm
https://github.com/ShifuML/shifu/issues/315
* VarSel Step Improvement
a) Refine VarSel Configurations
https://github.com/ShifuML/shifu/issues/262
b) Enable Multiple Threading in Sensitivity Analysis
https://github.com/ShifuML/shifu/issues/213
c) Add Feature Importance for Tree Model VarSelect
https://github.com/ShifuML/shifu/issues/218
* Stats Step Improvement
a) Add More Stats in Stats Step
https://github.com/ShifuML/shifu/issues/313
b) Change Distinct Count Computing from Init to Stats
https://github.com/ShifuML/shifu/issues/314
c) Bugs & Others
1) Default meta/categorical file support
2) Bug in stats on bad feature type
3) Add more stats on each MR job like number of filter records.
* Others
a) Eval Step Improvement
https://github.com/ShifuML/shifu/issues/150
b) CSV Format File Support
https://github.com/ShifuML/shifu/issues/258
c) Combo Model Training (Beta)
https://github.com/ShifuML/shifu/issues/316
Changes for Shifu-0.9.0
* Random Forest Enhancement
a) RF & GBDT Sort Categorical Features
https://github.com/ShifuML/shifu/issues/203
b) RF & GBDT Categorical Variables Unsorted Supported
https://github.com/ShifuML/shifu/issues/202
* Gradient Boosted Trees Enhancement
a) Master Fail Over
https://github.com/ShifuML/shifu/issues/227
b) GBT Support Continuous Model Training
https://github.com/ShifuML/shifu/issues/222
c) RF & GBDT Sort Categorical Features
https://github.com/ShifuML/shifu/issues/203
d) RF & GBDT Categorical Variables Unsorted Supported
https://github.com/ShifuML/shifu/issues/202
* Grid Search Support
a) NN Grid Search
https://github.com/ShifuML/shifu/issues/214
b) RF & GBDT Grid Search
https://github.com/ShifuML/shifu/issues/213
* Random Search Support
https://github.com/ShifuML/shifu/issues/234
* Multiple Classfication Enhancement
a) Add Random Forest Multiple Classfication
https://github.com/ShifuML/shifu/issues/235
b) Add OneVSAll Multiple Classfication for NN, RF and GBDT
https://github.com/ShifuML/shifu/issues/209
* Dynamic Binning Support
https://github.com/ShifuML/shifu/issues/236
* Others
a) https://github.com/ShifuML/shifu/issues/195
b) https://github.com/ShifuML/shifu/issues/229
Changes for Shifu-0.2.8
* Random Forest Support
a) https://github.com/ShifuML/shifu/issues/123
b) https://github.com/ShifuML/shifu/issues/122
* Gradient Boosted Trees Support
a) https://github.com/ShifuML/shifu/issues/124
b) https://github.com/ShifuML/shifu/issues/122
* Feature Importance in 'posttrain' Step
https://github.com/ShifuML/shifu/issues/180
* PSI Feature in 'stats' Step
https://github.com/ShifuML/shifu/issues/196
* Correlation Between Features in 'norm' Step
https://github.com/ShifuML/shifu/issues/146
* Others
a) https://github.com/ShifuML/shifu/issues/190
b) https://github.com/ShifuML/shifu/issues/181
c) https://github.com/ShifuML/shifu/issues/179
d) https://github.com/ShifuML/shifu/issues/178
Changes for Shifu-0.2.7
* Sampling Function Improvement
a) https://github.com/ShifuML/shifu/issues/93
b) https://github.com/ShifuML/shifu/issues/140
* Binning Improvement
a) https://github.com/ShifuML/shifu/issues/148
b) https://github.com/ShifuML/shifu/issues/157
* Stats Step Improvement
a) https://github.com/ShifuML/shifu/issues/155
b) https://github.com/ShifuML/shifu/issues/137
c) https://github.com/ShifuML/shifu/issues/75
* Norm Step Improvement
a) https://github.com/ShifuML/shifu/issues/103
b) https://github.com/ShifuML/shifu/issues/120
c) https://github.com/ShifuML/shifu/issues/131
d) https://github.com/ShifuML/shifu/issues/142
* Train Step Improvement
a) https://github.com/ShifuML/shifu/issues/66
b) https://github.com/ShifuML/shifu/issues/159
c) https://github.com/ShifuML/shifu/issues/166
d) https://github.com/ShifuML/shifu/issues/106
* Variable Selection Step Improvement
a) https://github.com/ShifuML/shifu/issues/57
b) https://github.com/ShifuML/shifu/issues/102
* Distributed LR Algorithm Improvement (Experimental)
a) https://github.com/ShifuML/shifu/issues/56
* Multiple classes NN Algorithm Improvement (Experimental)
a) https://github.com/ShifuML/shifu/issues/149
* Pig on Tez Support
Changes for Shifu-0.2.6
* https://github.com/ShifuML/shifu/issues/133: Add skewness and kurtosis stats
* https://github.com/ShifuML/shifu/issues/134: Add CSV ColumnConfig Format for ColumnConfig.json
* https://github.com/ShifuML/shifu/issues/117: Add AUC Computation on Eval Step
* https://github.com/ShifuML/shifu/issues/118: Add Shortcut Commands: 'norm', 'varsel'
* https://github.com/ShifuML/shifu/issues/127: Support HDP 2.6.0.2.2.4.2-2
* https://github.com/ShifuML/shifu/issues/83: Add Distinct Count Statistics
* https://github.com/ShifuML/shifu/issues/82: Auto-detect Variable Type
Changes for Shifu-0.2.5
* https://github.com/ShifuML/shifu/issues/97: Upgrade Guagua to latest version 0.7.0.
a) New features included in Guagua 0.6.0 to continuous improve performance of Shifu:
1) 'out-of-core' list to support worker to scale out from memory to disk.
2) Netty-based coordinators to decrease dependency on zookeeper and improve iteration communication performance.
3) Embedded zookeeper server supported not only in client as a thread, but also in master node as a process.
b) One improtant feature included in Guagua 0.7.0 to accelerate training in Shifu:
1) Partial-compete feature means in each iteration master only wait for partial workers complete and to
ignore straggler worker result.
* https://github.com/ShifuML/shifu/issues/105: SPDT stats performance improvement.
a) 'binningAlgorithm=SPDTI' (default value) in ModelConfig.json#stats is to improve scalability for big data.
This solution is based on SPDT binning algorithm and called SPDT-Improvement(SPDTI).
b) Using SPDTI, with 20 million of records and 1600 variables, 20 minutes to finish stats. With 100 million of
records and 1600 variables, 30 minutes to finish stats.
* https://github.com/ShifuML/shifu/issues/59: Shifu eval confusion and performance improvement.
a) With 20 million of records and 1600 variables, 13 minutes to finish eval step compared with 20 minutes in
Shifu 0.2.4.
* https://github.com/ShifuML/shifu/issues/64: Set the Hadoop parallel number automatically.
a) With input data set increase, user no need to set 'hadoopParallelNumber' in shifuconfig.
b) This value is tuned automatically new Shifu.
* Binning improvement
a) https://github.com/ShifuML/shifu/issues/77: Add missing value count as a bin.
b) https://github.com/ShifuML/shifu/issues/79: Add weights to binning.
c) https://github.com/ShifuML/shifu/issues/80: Weights binning KS/IV/WoE computing.
* https://github.com/ShifuML/shifu/issues/72: Support WoE transformation when doing normalization
* Training step improvement
a) https://github.com/ShifuML/shifu/issues/95: NN doesn't support 0 hidden layer.
b) https://github.com/ShifuML/shifu/issues/76: Add convergence parameter to Shifu d-train.
c) https://github.com/ShifuML/shifu/issues/84: Add local disk support to scale in-memory data set.
d) https://github.com/ShifuML/shifu/issues/60: Continuous model training.
e) https://github.com/ShifuML/shifu/issues/85: Add 'epochsPerIteration' parameter in NNWorker.
* Bug fix:
a) https://github.com/ShifuML/shifu/issues/98
b) https://github.com/ShifuML/shifu/issues/92
c) https://github.com/ShifuML/shifu/issues/70
d) https://github.com/ShifuML/shifu/issues/69
e) https://github.com/ShifuML/shifu/issues/67
Changes for Shifu-0.2.4
* https://github.com/ShifuML/shifu/issues/20: Work flow change.
a) Old: new -> init -> stats -> varselect -> normalize -> train -> eval
b) New: new -> init -> stats -> normalize -> varselect -> train -> eval
c) If do variable selection again after a model, current work flow no need do normalize step, after variable selection then do training step.
* https://github.com/ShifuML/shifu/issues/49: Add distributed sensitivity analysis variable selection.
a) 'varSelect.wrapperEnabled=true' and 'wrapperBy=SE' in ModelConfig.json#varSelect part to enable sensitivity variable selection.
b) 'wrapperRatio' in ModelConfig.json#varSelect part is a percent to set how many variables will be removed.
c) To continue variable selection by sensitivity method, run 'shifu varselect' again.
d) With 20 million of records and 1600 variables, 70 minutes (45 minutes for 200 epoch training and 25 minutes for sensitivity variable selection).
* https://github.com/ShifuML/shifu/issues/38: Improve scalability in stats step.
a) 'binningAlgorithm=SPDT' (default value) in ModelConfig.json#stats is to do variable statistics to improve scalability for big data.
Using SPDT, with 20 million of records and 1600 variables, 50 minutes to finish variable selection.
b) 'binningAlgorithm=MunroPat' in ModelConfig.json#stats is another approach to do variable statistics to improve scalability for big data.
* https://github.com/ShifuML/shifu/issues/58: Improve scalability in eval step for HDFS mode.
a) With 20 million of records and 1600 variables, 20 minutes to finish eval step with only 1GB driver memory.
* https://github.com/ShifuML/shifu/issues/61: Embeded zookeeper server support.
a) No need to set zookeeper servers so far since embeded zookeeper server will help on training models.
b) Big data training, independent zookeeper cluster is strongly recommended.
c) Upgrade Guagua to 0.5.0 to get support from Guagua for this feature.
* Add PMML standard model converter.
a) To convert .nn files into pmml, run "shifu export -t pmml" or just "shifu export" (The pmml is default)
All generated pmml files will be under <Model-Directory>/pmmls/
* Bug fix:
a) https://github.com/ShifuML/shifu/issues/45
b) https://github.com/ShifuML/shifu/issues/51
c) https://github.com/ShifuML/shifu/issues/39
d) https://github.com/ShifuML/shifu/issues/40
e) https://github.com/ShifuML/shifu/issues/45
Changes for Shifu-0.2.0
* Make Shifu to support Hadoop-2.0 (add -Phdp-yarn when building)
* Show mapper progress in JobTracker and show progress in CLI when using distribute training
* Validation rate = 0% is permit. In that case, save when train error goes down
* Generate better default ModelConfig, and create empty files for some configuration
* Refactor integration API - add static Normalizer.normalize(), simplify constructor of ModelRunner, and allow load models by path
* [Test] add support for decision-tree
* Enhance shifu script to make it support Hadoop1 and Hadoop2 smoothly
* Add new info for ColumnConfig: missing, total, missingPercentage, binWeightedPos and binWeightedNeg
* Update the layout of EvalPerformance.json
* Add version number in ModelConfig, ColumnConfig and EvalPerformance
Changes for Shifu-0.1.1
* Use gradient aggregation to improve distributed training model performance
* Fix the bug when sorting the model results
* Fix the bug - The sourceMetaColumnFile couldn't be read when using mapred + HDFS to run evaluation
* Hidden custom path in ModelConfig, since most users won't change them
* Add meta column names in file header, when using `mapred` to run evaluation
Changes for Shifu-0.1.0
* Refactor the item names in ModelConfig to make it follow http://10.9.187.2/project/agreement/
* Move zookeeperServers, hadoopNumParallel, hadoopJobQueue, localNumParallel into ${SHIFU_HOME}/conf/shifuconfig
* Enable customized path for ModelSet and modelsPath,scorePath,performancePath,confusionMatrixPath in Eval
* Comment out storing normalized data when using MapReduce to run evaluation
Changes for Shifu-0.0.4
* Add distributed nn implementation based on hadoop mapreduce job.
a). To trigger distributed nn, set 'runMode' to 'pig';
b). For distributed nn, please provide your own 'zkServers' of 'train' group.
c). You can set 'epochsPerIteration' which means in each iteration how many iterations will be executed.
* Eval refactor.
a). Add -score -confmat -perf options for eval command
b). Add "scoreColumn" option in ModelConfig.json to get the target score
c). Add "modelsPath" "scorePath" "confusionMatrixPath" "performancePath" options in ModelConfig.json
d). Change "metricColumnName" to "weightedColumn"
* TA457512 - Fix the bug: the delimiter of evaluation data doesn't take effect in AKKA mode
* TA458788 - Fix the bug: Meta validation fails to report error when - "NumHiddenNodes" : [ "a", 45 ]
* TA459375 - Write in-place QuickSort to replace Collections.sort() for memory consumption
Changes for Shifu-0.0.3
* TA446629 - Fix the bug: when there is am empty file, shifu in akka mode will be stucked
* TA446631 - Fix the bug: user can't use \t to split data in pig mode
* TA446678 - Fix the bug: when user create a new model and the model already exists, the log still shows the model is created successfully
* TA447772 - Fix the bug: when sync data from local to HDFS, the evaluation directory are in wrong place
* TA449606 - Fix the bug: the filter expression logic is opposite just as design
* TA449907 - Fix the bug: ignore those records whose value is not numerical while columnType is N in `shifu stats`
* TA449910 - Fix the bug: the fixInitInput doesn't work in model training
* TA451113 - Fix the bug: the calculating stats step consume more memory than before
* TA455487 - Fix the bug: Shifu doesn't support /data/output/{04,05}/*/part* in Akka mode
* TA457214 - Fix the bug: if the user put target column into Force.Remove and Force.Select, Shifu won't detect
* TA457490 - Fix the bug: evaluation data couldn't use different delimiter in AKKA mode
* DE30848 - hdfs + akka mode, 4g memory for 200m data but got OOM at stats step
* DE30836 - Non-existing target column might be better to be validated at init step
* DE30915 - disable the forceSelected option but still got the validation error
* DE30916 - Add forceRemove file at varselect step leads to Target column be covered by ForceRemove flag
* DE30922 - LearningRate cannot cast int to double
* DE30467 - The old model files should be cleaned up before training.
Changes for Shifu-0.0.2
* DE29230 - Fix the bugs if the training data path is HDFS globe path
* DE29231 - User only need put the configuration in local file system
* US201443 - PathFinder refactor, split Manager class into several processes
* US207747 - Add option in ModelConfig for job queue name
* US177973 - Update code license and test data license
* Don't copy data and purify data when run `shifu init`
* Add more comments
Changes for Shifu-0.0.1
* US152414 - Refactor ModelConfig
* US195914 - Refactor ColumnConifg
* US193995 - shutdown thread if errors occurred in akka mode