Data Acquisition for Generalizing Closed-Box Models

Thesis Link: https://hdl.handle.net/10315/41870

All code to reproduce experiment results are marked with underline, and the help of parameters for code execution can be found in the main entry of each code.

Set up

Under the currect directory, three folders need to be set to keep outputs from experiments.

log: keep all output from logger.
- config.yaml: model training, detector training, data split and shift. config.yaml under config folder is a template. Configs for all tasks are also in config folder.
model: keep new models, acquired data, updated detectors.
available data: data directory contains indices of designed data shifts from raw datasets
- Raw datasets:
1. Cifar-100: Indicate the path to the downloaded zip in the "root" field of the config file
2. Core-50: the original "core50_imgs.npz" file has images compressed to 32x32 and sampled off 30 frames from each object and session. The new dataset is called "core.pkl".

model_directory with the format of - (dataset name) _ task _ (other info) - is used to name all relevant outputs in log, model, and data.

Examples of model_directory: core_object_resnet, cifar-4class

Set up Source Models

pretrain.py: train source models and evaluate them before and after data shifts.

Data Acquisition + New Model Generation

utils/strategy.py: acquisition strategy + workspace (source model, data splits, detector if needed.)

utils/ood.py: filter out irrelevant data from the data pool. build novelty detector (one-class SVM) in the validation set, and return detection accuracy.

The filtering precision is obtained by test/ood.py
If the precision is bigger than threshold,the proportion of target labels in filtered market, then market filtering should work.

main.py : run the acquisition.

model_train.py : build a model from acquired data.

Test New Model

utils/checker.py: model ensemble methods

test.py: evaluate models after acquisition.

Acquisition Statistics

pretrain.py: base model performance before and after data shifts; classification accuracy on c_w and c_\not{w} in detector

seq_stat.py: only run sequential strategy and return statistics of detection accuracy or misclassification in acquired data.

stat_check.py: the distribution of acquired data; test data; final detector from sequential acquisition.

data_valuation.py: examine the relation between U-WS and U-WSD.

threshold.py : training, testing, and statistic analysis of experiments on utility threshold.

Invalid Acquired Data

Under optimal_check, pretrain.py builds optimal models, stat_check.py and threshold.py return the stat of invalid acquired data.

Dataset Generation

For a full experiment result reproduction, it's recommended to use the current data splits in data, as the random seed of generating this folder is just fixed explicitly.

Shifted data splits are generated first by split raw dataset into 4 splits (train, test, validation and data pool), and then make data shifts by removing some labels from train split. By far, we first save the indices of 4 data splits and statistics for data normalization into init_data directory. Next we save data shifts indices into data directoty.

data_setup.py: use "save mode" parameter to choose which indices to save (split or shift).

Preliminary processing of Core 50 can be found in Basic Process

core.ipynb: 'core50_imgs.npz' -> resize to 32x32 and transform labels -> 'core_data.pkl'
- An auxlirary file is used to extract labels from "core50_imgs.npz"
meta.py: sample frames from indicated categories from Core-50. ('core_data.pkl' -> sample frames -> 'core.pkl')

Visualization

All the figures reported in the paper are generated by visualization/plot.ipynb.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Acquisition for Generalizing Closed-Box Models

Set up

Set up Source Models

Data Acquisition + New Model Generation

Test New Model

Acquisition Statistics

Invalid Acquired Data

Dataset Generation

Visualization

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 616 Commits
basic_process		basic_process
config_example		config_example
data		data
init_data		init_data
optimal_check		optimal_check
test		test
utils		utils
visualization		visualization
.gitignore		.gitignore
data_setup.py		data_setup.py
data_valuation.py		data_valuation.py
main.py		main.py
model_train.py		model_train.py
pretrain.py		pretrain.py
readme.md		readme.md
requirements.txt		requirements.txt
seq_stat.py		seq_stat.py
stat_check.py		stat_check.py
test.py		test.py
threshold.py		threshold.py

t07902301/data-acquisition

Folders and files

Latest commit

History

Repository files navigation

Data Acquisition for Generalizing Closed-Box Models

Set up

Set up Source Models

Data Acquisition + New Model Generation

Test New Model

Acquisition Statistics

Invalid Acquired Data

Dataset Generation

Visualization

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages