What is max_grad_norm? #696

markuryy · 2024-08-09T17:49:09Z

markuryy
Aug 9, 2024

Hello, I have been training Flux LoRAs for a few days and I came across a discussion about using the flag --max_grad_norm=0.01 for stabilizing training with some models. Is there a layman's explanation of what it does?

For some context, I'm trying to train new anatomical features in Flux (i.e. extra limbs/appendages) and while they are appearing in the validation to a certain extent, they are always small and deformed. I was curious if this stabilization could be applied to Flux as well or if it would simply prevent the model from learning. At rank 4, 8, and 16 up to 10k steps without this arg, the results are simply not very close, and past that I just get a ball of limbs. So perhaps this is just not possible. I would imagine gradient clipping (?) would make the training take longer but I am not very well versed in the technical details.

Any help or guidance is appreciated :)

kaibioinfo · 2024-08-09T18:26:00Z

kaibioinfo
Aug 9, 2024

It's in there, just try it out.

What is does is it computes the norm over all parameters and if they are above the threshold it scales them down. you can think of it as some kind of dynamic learning rate: whenever the model gets excited and want to change something, learning rate is scaled down. This prevents that the model navigates through an unstable loss landscape.

A max grad norm of 0.01 sounds extremely low to me (as the norm is, as far as I understand, applied BEFORE the learning rate). So if learning rate is, say, 1e-4 and max_grad_norm is 1e-2 then the maximum a parameter could be changed per step is 1e-6 (realistically, it's more like 1e-8). For me it sounds like nothing happens anymore (in particular if parameters are low precision bf16). On the other hand: I absolutely agree that Flux training is extremely unstable. So maybe it helps, who knows.

If you try it, please share your results/experiences!

0 replies

bghira · 2024-08-09T18:38:21Z

bghira
Aug 9, 2024
Maintainer

i changed the default loss style for the flow-matching models to match Huawei's method, which overlaps with minRF and x-flux implementations.

if you could, i would suggest redoing a run with the same settings, plus ^ this change.

0 replies

markuryy · 2024-08-09T19:14:17Z

markuryy
Aug 9, 2024
Author

The run with --max_grad_norm=0.01 has been running for just under 14 hours and 5.5k training steps, although this is before the loss change so I'm not sure how helpful it is.

I'll share my config just for funsies

export FLUX=true
export CHECKPOINTING_STEPS=100
export CHECKPOINTING_LIMIT=3
export LEARNING_RATE=1.0
export MODEL_NAME="black-forest-labs/FLUX.1-dev"
export TRACKER_RUN_NAME="v9"
export MAX_NUM_STEPS=20000
export PUSH_TO_HUB="true"
export INSTANCE_PROMPT="AndroFlux, "
export VALIDATION_GUIDANCE=3.5
export VALIDATION_NUM_INFERENCE_STEPS=20
export VALIDATION_SEED=420
export TRAIN_BATCH_SIZE=2
export GRADIENT_ACCUMULATION_STEPS=1
export LR_SCHEDULE="constant_with_warmup"
export LR_WARMUP_STEPS=500
export OPTIMIZER="prodigy"
export TRAINER_EXTRA_ARGS="--lora_rank=4 --max_grad_norm=0.01 --base_model_precision=fp8-quanto --i_know_what_i_am_doing"
# Despite the flag, I in fact do not know what I am doing.

We might want to mark this thread as nsfw if you want to see any validation images, or I can post them somewhere else. I've been documenting some of the progress on Civitai (NSFW) but the images are very nightmare.

I will say that it got to basically the exact same validation images with (rank-8 and 9k steps) as (rank-4 and 3k steps) so I suppose I will continue with the lower rank. It's just that it makes the shape of the appendage with no further detail being added, but the added flag does seem to avoid the ball of limbs from what I can see at my limited high-step count tests.

I need to slow down though, I've spent a few hundred so far on running all these tests so I might just wait for someone else to bankroll a good LoRA.

Godspeed.

4 replies

bghira Aug 9, 2024
Maintainer

i think you'll need to use --base_model_default_dtype=fp32 with prodigy. it is not a pure bf16-capable optimiser and requires fp32 lora weights - hence the --i_know_what_im_doing flag being required

markuryy Aug 9, 2024
Author

Gotcha, I just copied that from the Finnish PM config haha. Sorry if this is dumb, but can I use prodigy without quantizing the model? I run it at default precision anyway (in comfy).

bghira Aug 9, 2024
Maintainer

the default is bf16, so nope, you'd need prodigy with fp32 weights which gets really fat really quickly. but, that's how the x-flux trainer works. so if you want to reproduce that trainer's results, you'll have to go to fp32 anyway.

mhirki Aug 9, 2024

I guess I need to release version 0.3 before more lemmings jump off a cliff...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is max_grad_norm? #696

{{title}}

Replies: 3 comments 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

What is max_grad_norm? #696

markuryy Aug 9, 2024

Replies: 3 comments · 4 replies

kaibioinfo Aug 9, 2024

bghira Aug 9, 2024 Maintainer

markuryy Aug 9, 2024 Author

bghira Aug 9, 2024 Maintainer

markuryy Aug 9, 2024 Author

bghira Aug 9, 2024 Maintainer

mhirki Aug 9, 2024

markuryy
Aug 9, 2024

Replies: 3 comments 4 replies

kaibioinfo
Aug 9, 2024

bghira
Aug 9, 2024
Maintainer

markuryy
Aug 9, 2024
Author

bghira Aug 9, 2024
Maintainer

markuryy Aug 9, 2024
Author

bghira Aug 9, 2024
Maintainer