New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

A faster and more memory-efficient implementation of `zero_to_fp32` #6658

Open

xu-song wants to merge 2 commits into microsoft:master from xu-song:patch-4

Contributor

xu-song commented Oct 23, 2024 •

edited

Loading

It is a faster and more memory-efficient implementation of zero_to_fp32.

The previous version double the memory usage, which cause cpu OOM for very large models (e.g. llama 405B).

DeepSpeed/deepspeed/utils/zero_to_fp32.py

Lines 438 to 441 in b647fb2

    
           # XXX: memory usage doubles here 
        
           state_dict[name] = torch.cat( 
        
               tuple(fp32_flat_groups[i].narrow(0, offset, partitioned_numel) for i in range(world_size)), 
        
               0).narrow(0, 0, unpartitioned_numel).view(shape)

How does it work?

Lazy loading: Load checkpoint with mmap=True, thus the weights are mmaped rather than loading all the storages into memory.
Lazy merge: GatheredTensor contains the mmaped weights and tensor offset. It is a memory-efficient pseudo tensor. Only when .contiguous() is called, it starts to load related weights to memory and merge into a single tensor.
Free memory in time: Release the memory once a shard is saved.

Throughout the process, only one shard of tensors are keeped in memory.

How much benefit in speed and memory ?

memory benefit:

Theoretically, this PR reduces the memory cost from 2M to (1/n)M, where M is the memory cost of the full weights, n is num_shards. For llama3.1-405B, n=191
In my test with 1TB cpu memory: converting llama3.1-405B from zero3 to fp32 got OOM; After optimization, the memory cost is about 200GB-300GB.

speed benefit:

the speed gain mainly comes from avoiding extra tensor copying.


          Faster and more memory-efficient impl of zero_to_fp32

19712a1

xu-song requested review from tjruwase and awan-10 as code owners

October 23, 2024 13:01

tjruwase reviewed

View reviewed changes

deepspeed/utils/zero_to_fp32.py Outdated

                           if debug:
                               print(f"{name} full shape: {shape} unpartitioned numel {unpartitioned_numel} ")
                           state_dict[name] = full_single_fp32_vector.narrow(0, offset, unpartitioned_numel).view(shape)
+                          # tensor = GatheredTensor(fp32_flat_groups, offset, partitioned_numel, shape)

Contributor

tjruwase Oct 23, 2024 •

edited

Loading

Remove commented code

Contributor Author

xu-song Oct 24, 2024

done

tjruwase reviewed

View reviewed changes

deepspeed/utils/zero_to_fp32.py Outdated

+                              tensor_slice.append(flat_tensor[start_offset:end_offset])
+                          pad_flat_param_chunks.append(torch.concat(tensor_slice, 0))
+                      pad_flat_param = torch.cat(pad_flat_param_chunks, dim=0).to(torch.float16)

Contributor

tjruwase Oct 23, 2024 •

edited

Loading

Why is this cast to fp16? The checkpoint could be a different dtype like bf16.

Contributor Author

xu-song Oct 24, 2024

to fp16 has been removed

tjruwase reviewed

View reviewed changes

deepspeed/utils/zero_to_fp32.py Show resolved Hide resolved

tjruwase reviewed

View reviewed changes

deepspeed/utils/zero_to_fp32.py Outdated

+                      return param
+                  # The following part make it compatible with `huggingface_hub.split_torch_state_dict_into_shards`
+                  # https://github.com/huggingface/huggingface_hub/blob/src/huggingface_hub/serialization/_torch.py

Contributor

tjruwase Oct 23, 2024

Can you clarify why this HF_Hub APIs are needed since pseudo tensor is never exported into the output checkpoint file?

Contributor Author

xu-song Oct 24, 2024 •

edited

Loading

DeepSpeed/deepspeed/utils/zero_to_fp32.py

Lines 565 to 566 in 6e6563d

    
           state_dict_split = split_torch_state_dict_into_shards(state_dict, 
        
                                                                 filename_pattern=filename_pattern,

HF_Hub APIs is used to split wights to shards.

Our pseudo tensor should be compatible with get_torch_storage_id and get_torch_storage_size in
https://github.com/huggingface/huggingface_hub/blob/v0.26.1/src/huggingface_hub/serialization/_torch.py#L335

Contributor

tjruwase Oct 24, 2024

My question is why is this compatibility a requirement? To my understanding, the output checkpoint file will contain torch.Tensor and not pseudo tensor. Am I wrong?

tjruwase reviewed

View reviewed changes

deepspeed/utils/zero_to_fp32.py Show resolved Hide resolved

tjruwase requested review from tohtana and removed request for awan-10

October 23, 2024 16:31

Contributor

tjruwase commented Oct 23, 2024


          Update zero_to_fp32.py

6d00a62

Contributor

tjruwase commented Oct 24, 2024

@xu-song, just to clarify, we greatly appreciate this PR. The memory and speed benefits are very useful. My only concern are the HF_Hub related changes, so hopefully those can be clarified.

Can you please add the observed speed and memory benefits of this optimizations? Such details are generally useful for readers to better appreciate the value. Thanks!

Contributor Author

xu-song commented Oct 25, 2024 •

edited

Loading

@xu-song, just to clarify, we greatly appreciate this PR. The memory and speed benefits are very useful. My only concern are the HF_Hub related changes, so hopefully those can be clarified.

Can you please add the observed speed and memory benefits of this optimizations? Such details are generally useful for readers to better appreciate the value. Thanks!

@tjruwase Is there any alternative approach to sharding torch state_dict?

If any, the compatible feature to huggingface_hub.split_torch_state_dict_into_shards can be discarded.

Contributor

tjruwase commented Oct 25, 2024

@tjruwase Is there any alternative approach to sharding torch state_dict?

If any, the compatible feature to huggingface_hub.split_torch_state_dict_into_shards can be discarded.

Sorry, but I am a bit confused about the objective of this PR. The goal of zero_to_fp32 is to create a consolidated checkpoint state from the sharded checkpoints of ZeRO-* training, so I don't understand why state_dict sharding is a consideration here.

It seems that there are two parts of this PR.

Speed and memory optimizations
HF_hub compatibility involving state_dict sharding

Am I correct?

Contributor Author

xu-song commented Oct 26, 2024

@tjruwase Is there any alternative approach to sharding torch state_dict?
If any, the compatible feature to huggingface_hub.split_torch_state_dict_into_shards can be discarded.

Sorry, but I am a bit confused about the objective of this PR. The goal of zero_to_fp32 is to create a consolidated checkpoint state from the sharded checkpoints of ZeRO-* training, so I don't understand why state_dict sharding is a consideration here.

It seems that there are two parts of this PR.

Speed and memory optimizations

HF_hub compatibility involving state_dict sharding

Am I correct?

DeepSpeed/deepspeed/utils/zero_to_fp32.py

Lines 565 to 567 in 54903e0

    
           state_dict_split = split_torch_state_dict_into_shards(state_dict, 
        
                                                                 filename_pattern=filename_pattern, 
        
                                                                 max_shard_size=max_shard_size)

To save memory, the tensors in state_dict is pesudo tensor instead of torch tensor.

1 similar comment

Contributor Author

xu-song commented Oct 26, 2024

@tjruwase Is there any alternative approach to sharding torch state_dict?
If any, the compatible feature to huggingface_hub.split_torch_state_dict_into_shards can be discarded.

Sorry, but I am a bit confused about the objective of this PR. The goal of zero_to_fp32 is to create a consolidated checkpoint state from the sharded checkpoints of ZeRO-* training, so I don't understand why state_dict sharding is a consideration here.

It seems that there are two parts of this PR.

Speed and memory optimizations

HF_hub compatibility involving state_dict sharding

Am I correct?

DeepSpeed/deepspeed/utils/zero_to_fp32.py

Lines 565 to 567 in 54903e0

    
           state_dict_split = split_torch_state_dict_into_shards(state_dict, 
        
                                                                 filename_pattern=filename_pattern, 
        
                                                                 max_shard_size=max_shard_size)

To save memory, the tensors in state_dict is pesudo tensor instead of torch tensor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet