Continual Pretraining: Unexpected Trainable Parameters in PEFT Model #1578

kailas711 · 2025-01-25T18:46:06Z

Hi
I encountered unusual behavior while using the Unsloth continual pre-training notebook (https://unsloth.ai/blog/contpretraining) with small language models (1B-2B parameters).

I used the model.print_trainable_parameters() to get the number of trainable parameters for Gemma 2:
trainable params: 1,200,414,720 || all params: 3,814,756,608 || trainable%: 31.4677

After patching the model (e.g., gemma-2-2b) with PEFT adapters using FastLanguageModel.get_peft_model, the reported trainable parameter count remains high (~3B) despite using a rank of 16. This behavior persists even when changing lora_r (16, 32, 64) and with other small models (llama-3.2-1B, Qwen-2.5-1.5B). and only small models

However, patching larger models (e.g., Mistral-7B-v0.1) results in the expected trainable parameters, like for Mistral-7B-v0.1 or any other models which have larger parameters the number flips to actual or expected scale:
trainable params: 41,943,040 || all params: 7,283,675,136 || trainable%: 0.5758

The Lora setting i used

model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj",
                      "up_proj", "down_proj",
                      "embed_tokens", "lm_head",],
    lora_alpha = 32,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = True,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Not sure what this behaviour is , any advice would be helpful.
Thank You.

The text was updated successfully, but these errors were encountered:

danielhanchen · 2025-01-28T11:05:19Z

@Erland366 Could you take a look at the Gemma 2 finetuning notebook and see if it functions fine thanks :)

Erland366 · 2025-01-28T23:07:07Z

Wait I'll check it out

Erland366 · 2025-01-29T01:53:13Z

I think there's no problem in the code

Gemma2, Llama3.2, and Qwen has huge amount of vocab size. Therefore the embedding and lm_head layer is very huge

When doing CPT, that means we activate the embedding and lm_head layer and actively training it. But we don't do LoRA on top of it. So we fully activate all of the parameters

Smaller model or larger model of the same version usually has the same amount parameters for that embedding and lm_head. Therefore the amount of trainable parameters for smaller model seems larger in percentage but it's actually kinda similar in absolute value. Here's below example of 2B Gemma vs 9B Gemma

It doesn't look as bad if you use smaller model with smaller vocab size such as TinyLlama like below :

kailas711 · 2025-01-29T02:11:55Z

Hi @Erland366.
Why is the total number of parameters exceeding 2B in case of Gemma.

That is also reflected in their memory usage.

And have you tired changing the lora_r parameters to see if the number of trainable parameters increase/decrease.

When I changed lora_r in bigger models i could see number if trainable parameters decreasing to 5% and 1% but in smaller models it seems to be stuck at 30%.

Erland366 · 2025-01-29T02:38:22Z

I am not sure exactly why we need to saves both the original_module and modules_to_save? I guess because when you doing LoRA, you can't just push gradient to the same exact tenor when doing training because you already separating it using LoRA for other model? Therefore you separate them both.

But the original_module one actually resides on cpu and (I think) the function still counts it as params. For tiny llama the number also increasing btw but it's just doesn't surpass the 1B mark.

print(model.base_model.model.model.embed_tokens.original_module.weight.device) # cpu
print(model.base_model.model.model.embed_tokens.modules_to_save.default.weight.device) # cuda:0

Yeah, increasing r increases the trainable params but it's very small. The embedding and lm_head didn't get affected by r since it's not geting LoRA-ed

danielhanchen added the currently fixing Am fixing now! label Jan 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Continual Pretraining: Unexpected Trainable Parameters in PEFT Model #1578

Continual Pretraining: Unexpected Trainable Parameters in PEFT Model #1578

kailas711 commented Jan 25, 2025 •

edited

Loading

danielhanchen commented Jan 28, 2025

Erland366 commented Jan 28, 2025

Erland366 commented Jan 29, 2025

kailas711 commented Jan 29, 2025 •

edited

Loading

Erland366 commented Jan 29, 2025 •

edited

Loading

Continual Pretraining: Unexpected Trainable Parameters in PEFT Model #1578

Continual Pretraining: Unexpected Trainable Parameters in PEFT Model #1578

Comments

kailas711 commented Jan 25, 2025 • edited Loading

danielhanchen commented Jan 28, 2025

Erland366 commented Jan 28, 2025

Erland366 commented Jan 29, 2025

kailas711 commented Jan 29, 2025 • edited Loading

Erland366 commented Jan 29, 2025 • edited Loading

kailas711 commented Jan 25, 2025 •

edited

Loading

kailas711 commented Jan 29, 2025 •

edited

Loading

Erland366 commented Jan 29, 2025 •

edited

Loading