Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finetuning input / target formatting #70

Open
AetherPrior opened this issue Nov 8, 2024 · 1 comment
Open

Finetuning input / target formatting #70

AetherPrior opened this issue Nov 8, 2024 · 1 comment

Comments

@AetherPrior
Copy link

Hi, I've a question on data collation for finetuning. I have some input questions and some targets, and wish to know if I need to include the inputs as part of my labels during causal finetuning. Specifically, I've defined my collation function as follows:

def chameleon_collate_fn(batch):
    # Extract the images and questions
    images = [ex['image'] for ex in batch]
    questions = ["<image>"+ex['question'] for ex in batch]
    labels = ["<image>"+ex['question'] + " " + ex['answer'] for ex in batch]
    
    # Process the batch using the processor
    batch_inputs = processor(images=images, text=questions, return_tensors="pt", padding=True)
    
    labels = processor(images=images, text=labels, return_tensors="pt", padding=True).input_ids # feels like labels should be the inputs + answer themselves? 
    
    # mask out pad tokens
    labels = labels.masked_fill(labels == processor.tokenizer.pad_token_id, -100)
    # mask the input from the labels
    labels[:, :len(batch_inputs["input_ids"])] = -100
    
    batch_inputs["labels"] = labels
    
    # Move inputs and labels to the appropriate device
    batch_inputs = {key: val.to('cuda') for key, val in batch_inputs.items()}

    return batch_inputs

However, when I pass these to the model call in the training loop:


for epoch in range(num_epochs):
   step_counter = 0
   for batch in tqdm(train_loader):
       inputs = batch
       model.to('cuda')
       # Forward pass
       outputs = model(**inputs)
       loss = outputs.loss

       # Backward pass
       loss.backward()
       optimizer.step()
       optimizer.zero_grad()

       # wandb log steps
       wandb.log({"global_step": step_counter})

       # Log loss
       wandb.log({"loss": loss.item()})
       step_counter += args.batch_size
   
   wandb.log({"epoch": epoch + 1})

   print(f"Epoch {epoch + 1}, Loss: {loss.item()}")

I get a ValueError:

ValueError: Expected input batch_size (2068) to match target batch_size (2070).

My batch size is 2, and my input shape is [2,1035] with my targets [2,1036] (one extra generation token for a numerical answer), so I'm not sure what's the issue here. Could someone help? Thanks!

@AetherPrior
Copy link
Author

I figured this out by changing the data collation pipeline to have the same input and output:

def chameleon_collate_fn(batch):
    # Extract the images and questions
    images = [ex['image'] for ex in batch]
    labels = ["<image>"+ex['question'] + " " + ex['answer'] for ex in batch]
    
    # Process the batch using the processor
    batch_inputs = processor(images=images, text=labels, return_tensors="pt", padding=True)
    
    labels = processor(images=images, text=labels, return_tensors="pt", padding=True).input_ids.clone() # feels like labels should be the inputs + answer themselves? 
    
    # mask out pad tokens
    labels = labels.masked_fill(labels == processor.tokenizer.pad_token_id, -100)
    # mask the input from the labels
    labels[:, :len(batch_inputs["input_ids"])] = -100
    
    batch_inputs["labels"] = labels
    
    # Move inputs and labels to the appropriate device
    batch_inputs = {key: val.to('cuda') for key, val in batch_inputs.items()}

    return batch_inputs

I'm still not too clear about whether this is correct so any comments on this will be highly appreciated!
Thank you so much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant