-
-
Notifications
You must be signed in to change notification settings - Fork 860
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ML Flow Checkpointing #1938
Comments
Hey, thanks for reporting this. I noticed that our callback doesn't inherit At least, do you notice that the yaml is saved to mlflow? |
I think @awhazell added support for mlflow and might be able to help |
I can confirm the config yaml is saved as an artifact, but also that the checkpoints/end model are not. A fix could be as simple as inheriting the callback @NanoCode012 linked (or adding it to the callbacks list separately) but would need to make sure it doesn't conflict with any of the setup from |
Hey @awhazell , what potential conflicts were you thinking of? My only concern may be duplicate logs due to Regarding the change needed, I believe we can just import the callback and append to the |
I was thinking about whether setting mlflow options in both the trainer kwargs and env variables could cause issues- but I think you're right and it shouldn't be an issue, they should always be consistent anyway Opened a PR here #1976 |
Please check that this issue hasn't been reported before.
Expected Behavior
I have the below config and I was hoping to have the model checkpoints saving as artifacts but I only get the metrics and config saving.
Is this expected to work or am I missing something?
Thanks for looking over!
Current behaviour
No model checkpoints.
Steps to reproduce
I use the docker image:
"winglian/axolotl:main-latest"
hf login
then
accelerate launch -m axolotl.cli.train theconfig.yml
Config yaml
Possible solution
No response
Which Operating Systems are you using?
Python Version
3.10
axolotl branch-commit
winglian/axolotl:main-latest
Acknowledgements
The text was updated successfully, but these errors were encountered: