Replies: 1 comment 2 replies
-
I now see that an async_dataset may be what I need? |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi,
For torch model training, as is standard, I'd like to split my dataset into batches of equal size, with each batch containing rows uniformly sampled from the dataset without replacement. Also, I'd like to have the dataloading be concurrent to my model training. How do I accomplish this?
The LanceDataset with the ShardedBatchSampler does not accomplish this. While it is concurrent (with
batch_readahead
), it samples contiguous batches of rows, and just randomizes the order of the batches.On the other hand, the model training as in the LLM example is fully random, but not concurrent. Making the torch dataloader concurrent via
num_workers
breaks lance, as discussed here.Beta Was this translation helpful? Give feedback.
All reactions