In Tutorial 2, we have introduced how to encapsulate a vision task by the template provided in PixelSSL. The datasets inherited from the class pixelssl/task_template/data.py/TaskFunc
are fully labeled. However, the SSL algorithms in PixelSSL support only one semi-supervised dataloader. Therefore, we provides some dataset wrappers in the file pixelssl/nn/data.py
to preprocess the fully labeled datasets for semi-supervised learning. Currently, there are two common options:
-
Split a fully labeled dataset into a labeled subset and an unlabeled subset.
Given a fully labeled dataset, we can remove the labels of some samples and treat them as unlabeled samples. To this end, we implement aSplitUnlabeledWrapper
. It requires an additional file to indicate the prefix of the labeled samples. This operation is widely used in the research of semi-supervised learning.
When using this dataset wrapper, the argumentsublabeled_path
should be set in the script. -
Combine multiple datasets into a semi-supervised dataset.
In practice, we may need to combine multiple datasets (including labeled and unlabeled) for semi-supervised learning. We provide aJointDatasetsWrapper
, which (1) takes a list of labeled datasets and a list of unlabeled datasets as input, and (2) combines all given datasets into a large dataset. The new dataset consists of a labeled subset and an unlabeled subset.
When using this dataset wrapper, the argumentunlabeledset
should be set in the script. In this case, the argumenttrainset
contains all labeled datasets whileunlabeledset
contains all unlabeled datasets.
To implement a new dataset wrapper for semi-supervised learning, you should (assuming you are currently at the root path of the project):
-
Create a new class inherited from the class
_SSLDatasetWrapper
in the filepixlssl/nn/data.py
. -
Implement the dataset wrapper refer to the implemented
SplitUnlabeledWrapper
andJointDatasetWrapper
. The key is to divide the index list into two parts, labeled (self.labeled_idxs
) and unlabeled (self.unlabeled_idxs
). -
Implement the logic of calling the data wrapper in the function
pixelssl/task_template/proxy.py/_create_dataloader
. Typically, you can use theTwoStreamBatchSampler
in the filepixlssl/nn/data.py
to read the semi-supervised dataset.
Now you can use the new dataset wrapper in the script to create a semi-supervised dataset!