Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continual Pretraining: Unexpected Trainable Parameters in PEFT Model #1578

Open
kailas711 opened this issue Jan 25, 2025 · 5 comments
Open
Labels
currently fixing Am fixing now!

Comments

@kailas711
Copy link

kailas711 commented Jan 25, 2025

Hi
I encountered unusual behavior while using the Unsloth continual pre-training notebook (https://unsloth.ai/blog/contpretraining) with small language models (1B-2B parameters).

I used the model.print_trainable_parameters() to get the number of trainable parameters for Gemma 2:
trainable params: 1,200,414,720 || all params: 3,814,756,608 || trainable%: 31.4677

After patching the model (e.g., gemma-2-2b) with PEFT adapters using FastLanguageModel.get_peft_model, the reported trainable parameter count remains high (~3B) despite using a rank of 16. This behavior persists even when changing lora_r (16, 32, 64) and with other small models (llama-3.2-1B, Qwen-2.5-1.5B). and only small models

However, patching larger models (e.g., Mistral-7B-v0.1) results in the expected trainable parameters, like for Mistral-7B-v0.1 or any other models which have larger parameters the number flips to actual or expected scale:
trainable params: 41,943,040 || all params: 7,283,675,136 || trainable%: 0.5758

The Lora setting i used

model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj",
                      "up_proj", "down_proj",
                      "embed_tokens", "lm_head",],
    lora_alpha = 32,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = True,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Not sure what this behaviour is , any advice would be helpful.
Thank You.

@danielhanchen danielhanchen added the currently fixing Am fixing now! label Jan 28, 2025
@danielhanchen
Copy link
Contributor

@Erland366 Could you take a look at the Gemma 2 finetuning notebook and see if it functions fine thanks :)

@Erland366
Copy link
Contributor

Wait I'll check it out

@Erland366
Copy link
Contributor

I think there's no problem in the code

Gemma2, Llama3.2, and Qwen has huge amount of vocab size. Therefore the embedding and lm_head layer is very huge

When doing CPT, that means we activate the embedding and lm_head layer and actively training it. But we don't do LoRA on top of it. So we fully activate all of the parameters

Smaller model or larger model of the same version usually has the same amount parameters for that embedding and lm_head. Therefore the amount of trainable parameters for smaller model seems larger in percentage but it's actually kinda similar in absolute value. Here's below example of 2B Gemma vs 9B Gemma

Image

Image

It doesn't look as bad if you use smaller model with smaller vocab size such as TinyLlama like below :

Image

@kailas711
Copy link
Author

kailas711 commented Jan 29, 2025

Hi @Erland366.
Why is the total number of parameters exceeding 2B in case of Gemma.

That is also reflected in their memory usage.

And have you tired changing the lora_r parameters to see if the number of trainable parameters increase/decrease.

When I changed lora_r in bigger models i could see number if trainable parameters decreasing to 5% and 1% but in smaller models it seems to be stuck at 30%.

@Erland366
Copy link
Contributor

Erland366 commented Jan 29, 2025

I am not sure exactly why we need to saves both the original_module and modules_to_save? I guess because when you doing LoRA, you can't just push gradient to the same exact tenor when doing training because you already separating it using LoRA for other model? Therefore you separate them both.

But the original_module one actually resides on cpu and (I think) the function still counts it as params. For tiny llama the number also increasing btw but it's just doesn't surpass the 1B mark.

print(model.base_model.model.model.embed_tokens.original_module.weight.device) # cpu
print(model.base_model.model.model.embed_tokens.modules_to_save.default.weight.device) # cuda:0

Yeah, increasing r increases the trainable params but it's very small. The embedding and lm_head didn't get affected by r since it's not geting LoRA-ed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
currently fixing Am fixing now!
Projects
None yet
Development

No branches or pull requests

3 participants