Thesis Link: https://hdl.handle.net/10315/41870
All code to reproduce experiment results are marked with underline, and the help of parameters for code execution can be found in the main entry of each code.
Under the currect directory, three folders need to be set to keep outputs from experiments.
-
log: keep all output from logger.
- config.yaml: model training, detector training, data split and shift. config.yaml under config folder is a template. Configs for all tasks are also in config folder.
-
model: keep new models, acquired data, updated detectors.
-
available data: data directory contains indices of designed data shifts from raw datasets
- Raw datasets:
model_directory with the format of - (dataset name) _ task _ (other info) - is used to name all relevant outputs in log, model, and data.
Examples of model_directory: core_object_resnet, cifar-4class
pretrain.py: train source models and evaluate them before and after data shifts.
utils/strategy.py: acquisition strategy + workspace (source model, data splits, detector if needed.)
utils/ood.py: filter out irrelevant data from the data pool. build novelty detector (one-class SVM) in the validation set, and return detection accuracy.
-
The filtering precision is obtained by test/ood.py
-
If the precision is bigger than threshold,the proportion of target labels in filtered market, then market filtering should work.
main.py : run the acquisition.
model_train.py : build a model from acquired data.
utils/checker.py: model ensemble methods
test.py: evaluate models after acquisition.
pretrain.py: base model performance before and after data shifts; classification accuracy on c_w and c_\not{w} in detector
seq_stat.py: only run sequential strategy and return statistics of detection accuracy or misclassification in acquired data.
stat_check.py: the distribution of acquired data; test data; final detector from sequential acquisition.
data_valuation.py: examine the relation between U-WS and U-WSD.
threshold.py : training, testing, and statistic analysis of experiments on utility threshold.
Under optimal_check, pretrain.py builds optimal models, stat_check.py and threshold.py return the stat of invalid acquired data.
- For a full experiment result reproduction, it's recommended to use the current data splits in data, as the random seed of generating this folder is just fixed explicitly.
Shifted data splits are generated first by split raw dataset into 4 splits (train, test, validation and data pool), and then make data shifts by removing some labels from train split. By far, we first save the indices of 4 data splits and statistics for data normalization into init_data directory. Next we save data shifts indices into data directoty.
data_setup.py: use "save mode" parameter to choose which indices to save (split or shift).
Preliminary processing of Core 50 can be found in Basic Process
-
core.ipynb: 'core50_imgs.npz' -> resize to 32x32 and transform labels -> 'core_data.pkl'
- An auxlirary file is used to extract labels from "core50_imgs.npz"
-
meta.py: sample frames from indicated categories from Core-50. ('core_data.pkl' -> sample frames -> 'core.pkl')
All the figures reported in the paper are generated by visualization/plot.ipynb.