Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make training on the MPS device work #131

Merged
merged 20 commits into from
Jan 24, 2025
Merged

Make training on the MPS device work #131

merged 20 commits into from
Jan 24, 2025

Conversation

dirkgr
Copy link
Member

@dirkgr dirkgr commented Jan 2, 2025

No description provided.

@dirkgr
Copy link
Member Author

dirkgr commented Jan 17, 2025

@epwalsh , some thoughts? This now has a train_single command that does almost everything for you. Most of the time you should be able to take whatever command you have failing on the big cluster, replace launch with train_single, and repro the error on your Mac. You'll have to do some stuff to make it fit into memory, but that can't be automated.

However, another view could be that "your local machine" is a "cluster" like any other. Then we don't need another command. But it's a little weird that the special automatic munging of configs we do only happens when your cluster is set to "local". Also hard to write the code for that.

@epwalsh
Copy link
Member

epwalsh commented Jan 17, 2025

I think having the separate command is fine

@dirkgr dirkgr marked this pull request as ready for review January 18, 2025 00:09
@dirkgr dirkgr requested a review from epwalsh January 18, 2025 00:09
@@ -76,48 +73,6 @@ def build_trainer_config(common: CommonComponents) -> TrainerConfig:
cancel_check_interval=10,
),
)
.with_callback(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change intentional?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. That callback doesn't initialize properly, because some of those datasets don't exist, and none of the other sizes have it.

Copy link
Member

@epwalsh epwalsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, other than the changes to the 1B config

@dirkgr dirkgr enabled auto-merge (squash) January 24, 2025 21:21
@dirkgr dirkgr merged commit 075a36a into main Jan 24, 2025
14 checks passed
@dirkgr dirkgr deleted the dirkg/MPS branch January 24, 2025 21:23
Copy link

This PR has been released in v1.8.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants