Random access with lookahead/concurrent loading, for model training with torch #3320

pimdh · 2024-12-31T10:26:33Z

pimdh
Dec 31, 2024

Hi,
For torch model training, as is standard, I'd like to split my dataset into batches of equal size, with each batch containing rows uniformly sampled from the dataset without replacement. Also, I'd like to have the dataloading be concurrent to my model training. How do I accomplish this?

The LanceDataset with the ShardedBatchSampler does not accomplish this. While it is concurrent (with batch_readahead), it samples contiguous batches of rows, and just randomizes the order of the batches.
On the other hand, the model training as in the LLM example is fully random, but not concurrent. Making the torch dataloader concurrent via num_workers breaks lance, as discussed here.

pimdh · 2024-12-31T12:26:19Z

pimdh
Dec 31, 2024
Author

I now see that an async_dataset may be what I need?
So then I create a torch IterableDataset that samples a random batch from the lance dataset, and make it be loading concurrently with async_dataset?

2 replies

westonpace Jan 3, 2025
Maintainer

I want to give a simple answer but I'm not quite sure I know it 😆 I have some familiarity with the non-async torch.data.LanceDataset but haven't done much with the torch.async_dataset.AsyncDataset.

There are a few parameters and I don't know for sure that we have a solution for all of them.

First, do you want a map-style dataset? If your goal is pure random access then maybe that would be more efficient? I don't believe we have a map-style dataset yet in the standard package but you could probably build one by first fetching all row ids, shuffling the row ids, and then using _take_rows to satisfy __getitem__.
Second, if you want an iterable-style dataset and you want pure random access then I know torch.data.LanceDataset is not what you want (as you have observed). It's possible that async_dataset does what you need but I don't know.
Third, do you care about performance and are you reading from S3 or NVMe? If you are reading from S3 then both of the above approaches are going to be potentially too slow. S3 can deliver a peak of around 20-25K rows per second. NVMe can (currently) deliver a peak 100-200K rows per second (someday this can be faster but that's 3-6 months off still). In my experience the first priority is to maximize GPU utilization and the second priority is to maximize randomness. The pseudo-random approach in torch.data.LanceDataset is designed to be "roughly random" but deliver higher performance. Whether or not you need that higher performance will probably depend on what kind of work you are doing per-row on the GPU.

pimdh Jan 7, 2025
Author

Thanks for coming back to me!

I'd prefer a solution in which we can define a map-style dataset which fetches rows, and then use the standard PyTorch DataLoader with multiprocess multiple workers to fetch the data. This is the default paradigm in pytorch and has two advantages:

If we need to do some data preprocessing, this is computed in parallel concurrent to the training. For example, this could be finding the neighbourhood lists in pointclouds.
If we do distributed training, it's more straightforward to set up the sampler correctly to sample different rows in each sampler.
Many ML frameworks/libraries, such as Ray Train, assume we're using the default dataloader.

I prefer to read the data directly from an S3-like store. As my ML models are pretty slow, this loading is sufficiently fast.

With some fiddling, I think I got a nice solution. Based on two observations:

The torch map-style Dataset can implement a (not very obviously documented) __getitems__ method. This way, we can use lance_dataset.take(row_idxs) to fetch an entire batch of rows at once. This is much faster than fetching the rows one-by-one.
The torch DataLoader has a multiprocessing_context option, which allows us to set multiprocessing="spawn" (instead of the default fork). This makes multiprocessing compatible with lance :).

So my solution is:

class LanceTorchDataset(Dataset):
    def __init__(self, dataset: lance.LanceDataset, columns: list[str] | None = None):
        self.dataset = dataset
        self.columns = columns

    def __len__(self):
        return len(self.dataset)

    def __getitems__(self, indices):
        return self.dataset.take(indices, columns=self.columns)

    def __getitem__(self, index):
        raise NotImplementedError("Use __getitems__ instead")


columns = .... # omitted 
collate_fn = ... # omitted

dset = lance.dataset(uri)
lance_dset = LanceTorchDataset(dset, columns=columns)
loader = DataLoader(
    lance_dset,
    collate_fn=collate_fn,
    batch_size=batch_size,
    shuffle=shuffle,
    num_workers=4,
    multiprocessing_context="spawn",
)

This works with multiple workers and gives pretty much linear speedup in the number of workers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Random access with lookahead/concurrent loading, for model training with torch #3320

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Random access with lookahead/concurrent loading, for model training with torch #3320

pimdh Dec 31, 2024

Replies: 1 comment · 2 replies

pimdh Dec 31, 2024 Author

westonpace Jan 3, 2025 Maintainer

pimdh Jan 7, 2025 Author

pimdh
Dec 31, 2024

Replies: 1 comment 2 replies

pimdh
Dec 31, 2024
Author

westonpace Jan 3, 2025
Maintainer

pimdh Jan 7, 2025
Author