Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

about hdf5 #11

Open
shawnthu opened this issue May 13, 2019 · 3 comments
Open

about hdf5 #11

shawnthu opened this issue May 13, 2019 · 3 comments

Comments

@shawnthu
Copy link

I have several hundred GB wav files on my disk (about 1, 000 hours wav data). I found directly reading the wav file is slow for training, so I choose lmdb and hdf5 as options. However I found that
hdf5 do not support concurrent, i.e. num_workers in Dataloader can not be more than 1, how do you solve this problem? thx

@shawnthu
Copy link
Author

https://pytorch.org/audio/datasets.html#yesno
torchaudio lists two examples, but the datasets the use are very small, so can be loaded into memory directly. This method does not fit for large data which can not fit into memory!

@shawnthu
Copy link
Author

Besides, I found a absurd phenomenon. e.g. all my wav files are under /wav folder. first I have a read wav function like this,
from scipy.io import wavfile def read_wav(wav_path): rate, data wavfile(wav_path) return data

@shawnthu
Copy link
Author

shawnthu commented May 13, 2019

Besides, I found a absurd phenomenon. e.g. all my wav files are under /wav folder. first I have a read wav function like this,

from scipy.io import wavfile
from torch.utils.data import Dataset, DataLoader
def read_wav(wav_path):
    rate, data wavfile(wav_path)
    return data

class Dst(Dataset):
    def __init__(self, wav_path_list):
        self.wav_path_list = wav_path_list
    def __len__(self):
        return len(self.wav_path_list)
    def __getitem(self, idx):
        return read_wav(self.wav_path_list[idx])

dst = Dst(wav_path_list)
loader = DataLoader(dst, batch_size, shuffle, num_workers)

in fact, when I set the num_workers from 0 to 4 (My work station equipped with 8 cpus), the speed do not
change! It looks like the read_wav function occupies all the cpu cores, which result the failure of num_workers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant