-
Notifications
You must be signed in to change notification settings - Fork 711
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TFX component never completes even though Vertex AI custom job succeeds / fails #6630
Comments
Following up here. I was able to copy the |
@clee421, As per your previous comment if this issue is resolved for you, Requesting you to close this issue. Thank you! |
@singhniraj08 Well the bug is still there. I'm copying your TFX file over as a workaround, I still would need a fix. |
Hi, @clee421. Thanks for investigating and giving the details. It helped a lot to understand your problem. While looking at your example code, I found out that there is a version incompatibility between From my own experiment, I got the desired result like Additionally, tfx.extensions.google_cloud_ai_platform.Trainer which internally uses
I don't have a clear idea how you wrapped If this doesn't work, please let me know. |
This poses a problem for us then. We specifically monkey patch |
I see. Please try with v1beta.types.job_state, and let us know if it works. |
@briron Thanks for the suggestion. I was able to test in an isolated environment with |
If the bug is related to a specific library below, please raise an issue in the
respective repo directly:
TensorFlow Data Validation Repo
TensorFlow Model Analysis Repo
TensorFlow Transform Repo
TensorFlow Serving Repo
System information
Interactive Notebook, Google Cloud, etc): GCP GKE Pod
pip freeze
output): N/A (I can provide the dependencies if it's deemed applicaple)Describe the current behavior
I have a pipeline which wraps runner.start_cloud_training and will run a custom job on vertex which will succeed or fail. The TFX component will continue to hang and not complete regardless of the custom job completion
Describe the expected behavior
I would expect the TFX component to complete when the vertex custom job completes.
Standalone code to reproduce the issue
I've debugged this the best I could and here is my finding.
I believe this line here:
Doesn't ever complete because the return value of
client.get_job_state(response)
is not an enum.Here is the script I used to test and validate my hypothesis
When running the snippet above I have the output
Converting the number to an enum by
JobState(custom_job.state)
fixes the problem.I hope this help, I would be more than happen to provide more information!
Providing a bare minimum test case or step(s) to reproduce the problem will
greatly help us to debug the issue. If possible, please share a link to
Colab/Jupyter/any notebook.
Name of your Organization (Optional)
Other info / logs
Include any logs or source code that would be helpful to diagnose the problem.
If including tracebacks, please include the full traceback. Large logs and files
should be attached.
The text was updated successfully, but these errors were encountered: