-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MeZo Forward Pass Implementation #601
Comments
This implementation is very interesting IMO and fits the goals of PEFT library, anyone wants to give it a try? |
I would say it's better as a third-party showcase. I don't think it makes sense to add as a core feature of the library yet. Maybe, based on the adoption and community response, we could revisit this. |
Hi @younesbelkada @sayakpaul! I am also interested in this change. From my perspective there are 2 ways to implement this change. 1) It would be nice to add a showcase example how to combine peft with their trainer e.g. in the |
Hello everyone, I have read the paper today. I concur with Sayak that before investing efforts into Custom Trainer for it we need to gauge interest and performance of MeZO. I like the approach 1 suggested by @Bearnardd to start with. |
I completely agree with your thoughts, @pacman100. Nowadays, numerous new ideas and papers emerge daily, making it impractical to dedicate time to include each one without first assessing the interest from the community and the actual performance of a particular solution. Simple example is a pretty cheap alternative :) |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. |
Aw, please implement MeZo guys, this could allow for near lossless compressed full fine tuned adaptors for true inference cost. Poor mans training would no longer have a in-superior connotation. Please @sayakpaul @pacman100 @Bearnardd ? |
Yes please |
I got this to work with bnb. train a 3b model with 5GB of ram using forward passes (effectively speeding up my training thanks to no need to do backward passes). I think it's worthwhile to integrate this, especially since it works alongside bnb . |
from 90it/s to 60/it/s with Mezo |
Nice! @thistleknot |
Yes and yes
…On Sat, Aug 19, 2023, 10:22 AM Jeduh ***@***.***> wrote:
Nice! @thistleknot <https://github.com/thistleknot>
When you say "bnb" do you mean 4bit?
When you say "from 90it/s to 60/it/s with Mezo" is 90 referring to single
forward pass speeds not using mezo?
—
Reply to this email directly, view it on GitHub
<#601 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABHKKOSK5P3ON546KWG7UXTXWDY55ANCNFSM6AAAAAAZMWBPV4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@thistleknot would the finetuned model export as nf4? Would MeZo's predictions be made in nf4 too? How does that work? |
yes and yes. the mezo function is simply a feed forward (I'm paraphrasing) compute_loss function, but it's plug and play with the lora setup with no qualms with integration that I see. I was trying to get it to work with gptq, but was unable to get a working setup, but I'm reading even if I did, it would still be limitied to a lora adapter. So the best you can get with this without peft/lora, is simply faster training. Shame relora isn't extended to anything other than llama. |
Great work man @thistleknot Shame indeed, relora has so much potential in distributing dense weights. Just wondering, do you think MeZo could full finetune in fp8? I'm not even sure if the HF format allows fp8 safetensors. If so, 2x less memory would be nice with little to no loss. I've already got flash attention 2 working :) So close to full finetuning larger models on consumer hardware. |
you can finetune with a lower fp, but ... you can't save the model! hence,
stuck with lora. I tried auto-gptq, but couldn't get it to work, but I
don't see any instructions on how to save a gptq model. Transformers
doesn't support it yet, or if you know the process, I'm all ears, because
I'd much rather prefer not to use a lora adapter, but it is what it is.
I'm using layers of training to accommodate for the fact I have to use lora.
…On Sun, Aug 20, 2023 at 6:28 AM Jeduh ***@***.***> wrote:
Great work man @thistleknot <https://github.com/thistleknot>
Shame indeed, relora has so much potential in distributing dense weights.
Just wondering, do you think MeZo could full finetune in fp8? I'm not even
sure if the HF format allows fp8 safetensors. If so, 2x less memory would
be nice with little to no loss. I've already got flash attention 2 working
:) So close to full finetuning larger models on consumer hardware.
—
Reply to this email directly, view it on GitHub
<#601 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABHKKOROJA4OUIUHFHHIK33XWIGGPANCNFSM6AAAAAAZMWBPV4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I think it would be quite tedious to change and save gtpq models on the fly. |
bigscience-workshop/petals#273 I was reminded this PR that managed to save 8bit models :) and learnt pytorch is working on e5m2 and e4m3 FP8 implementation which should be distributed better for transformers models. |
got to wait for it to be merged. For now, I'm using a fancy layering setup. I was thinking I would have lora be optional (controlled by a parameter). So once I figure out the 'best' solution (I would prefer quantized weights, use mezo, save quantized weights... but that as you know isn't 'implemented' fully yet). In the meantime for sake of my sanity and testing, I'm simply using 1 document + 1 use case (squad) using lora (16GB p5200 on an m7730, best $600 I ever spent). Then maybe one day when a magical better method like qlora or saving gptq models. |
Feature request
https://github.com/princeton-nlp/MeZO/blob/main/large_models/trainer.py
Motivation
Memory efficiency while training
Your contribution
Willing to test, train, and write-up bug reports.
The text was updated successfully, but these errors were encountered: