validate `use_remove_padding` when applying sequence parallelism #153

chujiezheng · 2025-01-28T17:39:10Z

This is because the ulysses_sp is activated only when use_remove_padding is enabled:

Lines 71 to 128 in ab525bc

    
           if self.use_remove_padding: 
        
               input_ids_rmpad, indices, *_ = unpad_input(input_ids.unsqueeze(-1), 
        
                                                          attention_mask)  # input_ids_rmpad (total_nnz, ...) 
        
               input_ids_rmpad = input_ids_rmpad.transpose(0, 1)  # (1, total_nnz) 
        
               # unpad the position_ids to align the rotary 
        
               position_ids_rmpad = index_first_axis(rearrange(position_ids.unsqueeze(-1), "b s ... -> (b s) ..."), 
        
                                                     indices).transpose(0, 1) 
        
               # for compute the log_prob 
        
               input_ids_rmpad_rolled = torch.roll(input_ids_rmpad, shifts=-1, dims=1)  # (1, total_nnz) 
        
               # pad and slice the inputs if sp > 1 
        
               if self.use_ulysses_sp: 
        
                   input_ids_rmpad, position_ids_rmpad, pad_size = ulysses_pad_and_slice_inputs(input_ids_rmpad, \ 
        
                                                                                               position_ids_rmpad, \ 
        
                                                                                               sp_size=self.ulysses_sequence_parallel_size) 
        
                   input_ids_rmpad_rolled, _, _ = ulysses_pad_and_slice_inputs(input_ids_rmpad_rolled, None, 
        
                                                                               self.ulysses_sequence_parallel_size) 
        
               input_ids_rmpad_rolled = input_ids_rmpad_rolled.squeeze(0)  # ((total_nnz / sp) + pad) 
        
               # only pass input_ids and position_ids to enable flash_attn_varlen 
        
               output = self.actor_module(input_ids=input_ids_rmpad, 
        
                                          attention_mask=None, 
        
                                          position_ids=position_ids_rmpad, 
        
                                          use_cache=False)  # prevent model thinks we are generating 
        
               logits_rmpad = output.logits.squeeze(0)  # (total_nnz, vocab_size) 
        
               logits_rmpad.div_(temperature) 
        
               # compute entropy 
        
               entropy_rmpad = self.compute_entropy_from_logits(logits_rmpad)  # ((total_nnz / sp) + pad) 
        
               # if use_sp: ((total_nnz / sp) + pad) ; if not use_sp: (batch, seqlen) 
        
               log_probs = logprobs_from_logits(logits=logits_rmpad, labels=input_ids_rmpad_rolled) 
        
               # gather log_prob if sp > 1 
        
               if self.use_ulysses_sp: 
        
                   # gather and unpad for the ulysses sp 
        
                   log_probs = gather_outpus_and_unpad(log_probs, gather_dim=0, unpad_dim=0, padding_size=pad_size) 
        
                   entropy_rmpad = gather_outpus_and_unpad(entropy_rmpad, 
        
                                                           gather_dim=0, 
        
                                                           unpad_dim=0, 
        
                                                           padding_size=pad_size) 
        
               # pad back to (bsz, seqlen) 
        
               full_entropy = pad_input(hidden_states=entropy_rmpad.unsqueeze(-1), 
        
                                        indices=indices, 
        
                                        batch=batch_size, 
        
                                        seqlen=seqlen) 
        
               full_log_probs = pad_input(hidden_states=log_probs.unsqueeze(-1), 
        
                                          indices=indices, 
        
                                          batch=batch_size, 
        
                                          seqlen=seqlen) 
        
               # only return response part: 
        
               entropy = full_entropy.squeeze(-1)[:, -response_length - 1:-1]  # (bsz, response_length) 
        
               log_probs = full_log_probs.squeeze(-1)[:, -response_length - 1:-1]  # (bsz, response_length)

Without this check, users may encounter OOM issues when the set sp_size > 1 but use_remove_padding is mistakenly disabled.

vermouth1992 · 2025-01-29T02:19:58Z

Could you format the code by running bash script/format.sh

chujiezheng · 2025-01-29T02:23:32Z

Done!

vermouth1992 · 2025-01-29T04:47:08Z

verl/utils/config.py

@@ -86,4 +86,13 @@ def check_mutually_exclusive(mbs, mbs_per_gpu, name: str):
            assert config.critic.ppo_mini_batch_size % config.critic.ppo_micro_batch_size == 0
            assert config.critic.ppo_micro_batch_size * sp_size >= n_gpus

+    # Check if use_remove_padding is enabled when using sequence parallelism
+    if config.actor_rollout_ref.actor.ulysses_sequence_parallel_size > 1:


It seems that the correct key is config. actor_rollout_ref.model.use_remove_padding, critic.model.use_remove_padding and reward_model.model.use_remove_padding

validate when applying sequence parallelism

67b58f0

vermouth1992 approved these changes Jan 29, 2025

View reviewed changes

run format.sh

1a49109

vermouth1992 reviewed Jan 29, 2025

View reviewed changes

fix 'actor.use_remove_padding' to 'actor.model.use_remove_padding'

7c662ef

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

validate `use_remove_padding` when applying sequence parallelism #153

validate `use_remove_padding` when applying sequence parallelism #153

chujiezheng commented Jan 28, 2025

vermouth1992 commented Jan 29, 2025

chujiezheng commented Jan 29, 2025

vermouth1992 Jan 29, 2025

chujiezheng Jan 30, 2025

	if self.use_remove_padding:
	input_ids_rmpad, indices, *_ = unpad_input(input_ids.unsqueeze(-1),
	attention_mask) # input_ids_rmpad (total_nnz, ...)
	input_ids_rmpad = input_ids_rmpad.transpose(0, 1) # (1, total_nnz)

	# unpad the position_ids to align the rotary
	position_ids_rmpad = index_first_axis(rearrange(position_ids.unsqueeze(-1), "b s ... -> (b s) ..."),
	indices).transpose(0, 1)

	# for compute the log_prob
	input_ids_rmpad_rolled = torch.roll(input_ids_rmpad, shifts=-1, dims=1) # (1, total_nnz)

	# pad and slice the inputs if sp > 1
	if self.use_ulysses_sp:
	input_ids_rmpad, position_ids_rmpad, pad_size = ulysses_pad_and_slice_inputs(input_ids_rmpad, \
	position_ids_rmpad, \
	sp_size=self.ulysses_sequence_parallel_size)
	input_ids_rmpad_rolled, _, _ = ulysses_pad_and_slice_inputs(input_ids_rmpad_rolled, None,
	self.ulysses_sequence_parallel_size)

	input_ids_rmpad_rolled = input_ids_rmpad_rolled.squeeze(0) # ((total_nnz / sp) + pad)

	# only pass input_ids and position_ids to enable flash_attn_varlen
	output = self.actor_module(input_ids=input_ids_rmpad,
	attention_mask=None,
	position_ids=position_ids_rmpad,
	use_cache=False) # prevent model thinks we are generating
	logits_rmpad = output.logits.squeeze(0) # (total_nnz, vocab_size)

	logits_rmpad.div_(temperature)

	# compute entropy
	entropy_rmpad = self.compute_entropy_from_logits(logits_rmpad) # ((total_nnz / sp) + pad)

	# if use_sp: ((total_nnz / sp) + pad) ; if not use_sp: (batch, seqlen)
	log_probs = logprobs_from_logits(logits=logits_rmpad, labels=input_ids_rmpad_rolled)

	# gather log_prob if sp > 1
	if self.use_ulysses_sp:
	# gather and unpad for the ulysses sp
	log_probs = gather_outpus_and_unpad(log_probs, gather_dim=0, unpad_dim=0, padding_size=pad_size)
	entropy_rmpad = gather_outpus_and_unpad(entropy_rmpad,
	gather_dim=0,
	unpad_dim=0,
	padding_size=pad_size)
	# pad back to (bsz, seqlen)
	full_entropy = pad_input(hidden_states=entropy_rmpad.unsqueeze(-1),
	indices=indices,
	batch=batch_size,
	seqlen=seqlen)
	full_log_probs = pad_input(hidden_states=log_probs.unsqueeze(-1),
	indices=indices,
	batch=batch_size,
	seqlen=seqlen)

	# only return response part:
	entropy = full_entropy.squeeze(-1)[:, -response_length - 1:-1] # (bsz, response_length)
	log_probs = full_log_probs.squeeze(-1)[:, -response_length - 1:-1] # (bsz, response_length)

validate use_remove_padding when applying sequence parallelism #153

Are you sure you want to change the base?

validate use_remove_padding when applying sequence parallelism #153

Conversation

chujiezheng commented Jan 28, 2025

vermouth1992 commented Jan 29, 2025

chujiezheng commented Jan 29, 2025

vermouth1992 Jan 29, 2025

Choose a reason for hiding this comment

chujiezheng Jan 30, 2025

Choose a reason for hiding this comment

validate `use_remove_padding` when applying sequence parallelism #153

validate `use_remove_padding` when applying sequence parallelism #153