Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update maxtext checkpoitin test to add async #571

Merged
merged 7 commits into from
Jan 27, 2025

Conversation

anfals
Copy link
Collaborator

@anfals anfals commented Jan 18, 2025

Description

Updated maxtext_checkpointing test to test both async and sync

Tests

Minor tweaks to calling maxtext e2e test. Ran the test locally with same config

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run one-shot tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed.

Copy link
Collaborator

@RissyRan RissyRan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Have you tested it on airflow yet? sometimes the cmds will parse differently. Would be nice to have one run in the environment.

@anfals
Copy link
Collaborator Author

anfals commented Jan 24, 2025

LGTM! Have you tested it on airflow yet? sometimes the cmds will parse differently. Would be nice to have one run in the environment.

By this do you mean testing in my own dev airflow environment?

@RissyRan
Copy link
Collaborator

LGTM! Have you tested it on airflow yet? sometimes the cmds will parse differently. Would be nice to have one run in the environment.

By this do you mean testing in my own dev airflow environment?

Yes. That's right.

@anfals
Copy link
Collaborator Author

anfals commented Jan 26, 2025

LGTM! Have you tested it on airflow yet? sometimes the cmds will parse differently. Would be nice to have one run in the environment.

By this do you mean testing in my own dev airflow environment?

Yes. That's right.

@RissyRan I'm running into many issues trying to get my XLML dev environment stable. But, I can see that the the currently running tests are running with async checkpointing successfully: https://efdb2a1d6c2c435b8b7de77690c286a9-dot-us-central1.composer.googleusercontent.com/dags/maxtext_checkpointing/grid?task_id=maxtext-checkpointing-stable-v4-8.run_model.wait_for_workload_completion&dag_run_id=scheduled__2025-01-25T10%3A00%3A00%2B00%3A00&tab=logs

From the dev tests I did get running before my environment started crashing, I did see that the arg passing in the xlml test I updated had some issues, so I've instead explicitly passed args for everything. https://3e3ff8d06f3a490aa2f9778c110f8adf-dot-us-central1.composer.googleusercontent.com/dags/maxtext_checkpointing/grid?execution_date=2025-01-26+17%3A55%3A17.636853%2B00%3A00&task_id=maxtext-checkpointing-nightly-sync-2xv4-16&tab=details&dag_run_id=manual__2025-01-26T17%3A55%3A17.636853%2B00%3A00 shows that all the test cases I expect are being created (for both sync and async). And the arg parsing is less unusual now :D

I'm thinking it might be easier to merge this and monitor the next run on XLML, rather than spend more time trying to fix my dev environment. If there are issues with the test, I can update.

@RissyRan
Copy link
Collaborator

Thank you for the test!

@anfals anfals merged commit 5dbc596 into master Jan 27, 2025
6 checks passed
@anfals anfals deleted the anfals/maxtext_checkpointing_async branch January 27, 2025 18:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants