-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
After fine-tune, how to correctly save the text encoder for use with StableDiffusionXLPipeline.from_pretrained? #15
Comments
Yeah, the naming of the keys and the way they are converted / expected for HuggingFace (diffusers/transformers) is pretty "delicate"; fortunately, I recently discovered that the HF team updated their conversion script, so it works with recent versions of "transformers"! Next, to extract the text encoder only and to ensure I include the correct keys, I used a trick:
PS: I'm curious about your results with multi-GPU training for CLIP alone; technically, a larger batch_size (larger than 24 GB VRAM allows) should be great for CLIP (albeit Geometric Parametrization manages to offset the otherwise catastrophic effects of tiny batch sizes, but 'more' should still be better, in theory - I never tried, I only have 1 GPU). Wishing you much success! :) |
Thanks for the response! I tried "convert_clip_original_pytorch_to_hf.py", but kept getting EOF errors, even when I trained with "ft-B-train-OpenAI-CLIP-ViT-L-14_test_0.py". So I gave up and instead modified "ft-B-train-OpenAI-CLIP-ViT-L-14_test_0.py" to use transformers. Training loss looks similar and I can save with save_pretrained. Just tested it out with sdxl pipeline, works easily. It's enough to start testing CLIP training. If needed later, I'll try to figure out what's going on with converting to huggingface. Also, accelerate FSDP cpu_offset works with 1 gpu, and with little impact on training speed.
When I get some more time to sit down and test out CLIP training, I'll try to add accelerate FSDP in. If you try to add it yourself before then, and run into any problems, just let me know. |
Thank you for the tip, too! Modifying the script to use accelerate was the easy part - but how on earth did you get to configure it at all? :-)
Are you not simply using "accelerator = Accelerator()" (can't pass a config there, either, haha - I tried!) and "model, optimizer, train_dataloader, val_dataloader, scheduler = accelerator.prepare(model, optimizer, train_dataloader, val_dataloader, scheduler)" and all in the python script, maybe? What's this sorcery? 🙃 Thanks a lot for your help at this point! 😀 |
I got it to work, on a single rtx4090 ran a bsz of 220 for 1 epoch. To setup your accelerate config file, in terminal activate venv if needed, then type "accelerate config", follow the instructions, and it'll put a config yaml in ~/.cache/huggingface/accelerate/ source venv/bin/activate Here's my accelerate config "default_config.yaml" that I used: compute_environment: LOCAL_MACHINE "fsdp_activation_checkpointing" through "fsdp_use_orig_params" is indented. I don't know how to make comments have indents. yes, I use "accelerator = Accelerator()". But you put things like mixed_precision or gradient_accumulation stuff there. For this: "model, optimizer, train_dataloader, val_dataloader, scheduler = accelerator.prepare(model, optimizer, train_dataloader, val_dataloader, scheduler)" #when using FSDP, prepare model first Best of luck! |
Thank you very much, I will try that! I actually had indentation etc. for (after) PS: I also just committed |
Thanks for Convert-for-HuggingFace-Spaces-etc. When I finish these two projects, I'll go back and try again. As I said before, my friend pointed me at this to see how multi-gpu was used to create huge batch sizes when training CLIP: Initially, I was thinking I could emulate multi-gpu by storing the data each batch, and collecting it together to calculate loss.But then I found this pull request which does exactly that (assuming I understand it correctly): I think it may be a better alternative to FSDP, since it allows for an effective infinite batch size, assuming everything fits in memory. Accelerate distributed state could then be used instead for multi-gpu, which is much simpler. I copied some of the pull request's code changes, and after some very very very long discussions with ChatGPT, I think I got it working. I tested it with --accum-freq of 2 and bsz 40, for a few epochs, and things look correct. I'm using the huggingface version of CLIP, but the core of the code is below: ` accum_image_features = []
I've put this code and the launch script, and the stuff I toyed with for accelerate FSDP at: I modified the all the dataset related stuff so I could use my already prepared sdxl training dataset. So you'll need to put yours back in. Hope you find something useful. I'm going to re-check this accum_freq code to make sure it's working correctly, then go to try to understand that GmP stuff your doing. Then I'll go back and test that Convert-for-HuggingFace-Spaces-etc you put up. Again, thanks for this awesome CLIP training repo! |
It's definitely interesting; with Flux.1, there's now a thing called block swapping (in reference to the diffusion transformer), and that's how you pull off fine-tuning 12 billion parameters on 24 GB of VRAM: https://github.com/bmaltais/kohya_ss/tree/sd3-flux.1. I really wish I could clone myself into 10 AI agents that can go and explore all that is interesting in AI, then come back with an implementation for CLIP (because fine-tuning Big-G is another of the open issues on this repo, and I'd very much be interested in doing that, too!). Meanwhile, I am hand-compiling PyTorch to even get this to work without downgrading - no libuv on Windows in pytorch. I compiled that and the good GPT-4o fixed the includes in C++ that did not apply to windows. Now it compiles without gloo (but with MPI), so AI & I just need to fix gloo! (Which has its own value, being able to compile PyTorch and dependencies, so I'm still gonna pursue this - but thank you for sharing the code & info & new approach, I'll be sure to try that as well! No worries about it being a 'hot mess' - I am very used to that, haha!). Re: The GmP stuff, I also mentioned that in the now-quite-lengthy readme.md - but just in case, here's the link to the paper that inspired CLIP-GmP: https://arxiv.org/abs/2305.15912v4 It mentions ImageNet and ReLU, but why wouldn't it work for CLIP + GELU?! -> It does, I found out. After, just like you did - a lengthy discussion with GPT-4*. :-) |
I tested accum_feat 1 through 32, with bsz 40-45 & lr 5e-7, for 10 epochs. The f1 & logits charts are missing values due to using wrong scale or being added later, but loss/val_loss/val_acc tell a good enough story. I tested bsz 45 * accum_feat 32, which Cuda OOM error. So it's still memory bound. After I'm done with CLIP-L, I'll look at CLIP-G. Maybe a peft of some kind with FSDP cpu_offset to increase vram? You mentioned the vram requirements were crazy huge. Best of luck getting it to work on windows. My coding skills aren't good enough to trouble-shoot library issues on windows. After I realized I was going to be training models and learning deep learning for the foreseeable future, I slowly made the transition to a stand alone Ubuntu training box. |
Thanks for sharing these results! About exploding gradients: As long as it's not constantly 'inf', and especially if it occurs in the earlier layers or the middle (0 and, if I remember right, 14 (or 15?) are candidate "delicate" layers for exploding gradients), it should be fine; CLIP's GELU seems to be very robust with handling very large or vanishing gradients. If it keeps happening and / or is inf, you could try lowering the learning rate of just that layer specifically. And yeah, I considered PEFT, the LoRA for CLIP - but I think CLIP is best off with all-weights-require-gradient. Albeit if you take a look at my CLIP-entropy repo, it seems that the later layers make the largest difference - at least when it comes to CLIP's attention weights. Input layers often have large gradients, though, so - I am not sure if I should discount that as "doesn't matter", as there's also the MLP features, after all (not just Attn). And they change quite dramatically even in layers 7-10 and stuff. Also, check out this new paper and code - INF-CLIP! Very interesting plots about batch_size vs. accuracy there, too! I just started poking around (doing 5 different things at the same time on a Saturday again, haha), but I am investigating this INFINITE CLIP! |
Here's an initial implementation of Inf-CLIP with GmP, if you're interested: github.com/zer0int/Inf-CLIP |
Thanks for the heads up. |
First, thanks for all your work on this repo, it's great stuff!
After fine-tune, how to correctly save the text encoder for use with:
CLIPTextModel.from_pretrained & StableDiffusionXLPipeline.from_pretrained ?
after converting back, I tried:
text_encoder = original_model.transformer
text_encoder_state_dict = text_encoder.state_dict()
torch.save(text_encoder_state_dict, 'ft-checkpoints/text_encoder_state_dict.pth')
but when I loaded the the stat_dict on to the text_encoder from the sdxl pipeline I got:
[rank0]: RuntimeError: Error(s) in loading state_dict for CLIPTextModel:
[rank0]: Missing key(s) in state_dict: ...
Any additional information you want to provide would be appreciated.
My goal is to use the fine-tuned CLIP-ViT/L when I train the sdxl unet & maybe clip-G, then save the final fine-tuned model in diffusers/safetensors format. I'm using a custom accelerate FSDP script I wrote to train sdxl.
Thanks for the great repo!
Also, have you thought about using accelerate FSDP cpu_offset to increase the batch size?
I ran some quick tests on my sdxl trainer and AdaBelief works fine with FSDP cpu_offset and sharding the unet.
Should be some easy changes to your script to add cpu_offset increased batch size & sharding for multi-gpu training.
Once I can get the fine-tuned CLIP-ViT/L working in my sdxl training script, I'll test out adding FSDP to your CLIP training script.
The text was updated successfully, but these errors were encountered: