-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Previously working Dataflow jobs started crashing when sharding #10971
Comments
The validation/test is only 52,880 examples and also 512 shards, so maybe the training shards get really big and down a Dataflow worker somehow? |
@carlthome could you provide more details so that I could look into it?
|
Last known time the DatasetBuilder worked on Dataflow was Dec 6, 2023, with the following requirements.txt
Running the same today results in
Perhaps there has been unintended changes on the Dataflow side, rather than in the Beam pipeline. This is in the Dataflow job logs:
|
Maybe related: https://issuetracker.google.com/issues/368255186 |
Thanks for providing this information! I think you're right that this could be an issue with DataFlow. I'm particularly suspicious of I suggest that you try:
|
We get the same error with 2.60.0 and 4.9.7 unfortunately. |
Have you also updated At this point I recommend to create a minimum reproducible example that fails locally, otherwise it seems that you need to resolve dependency conflicts in your environment and make sure that it's compatible with DataFlow. |
Good point! I tried now with latest PyPI versions and Beam 2.61.0 yet experience the same crash. Here's the Dataflow image SBOM:
|
I have a DatasetBuilder for terrabytes of audio that used to work and ran to completion on Dataflow, but have stopped working. The code is unchanged. The data shouldn't have changed. We've been unable to debug this so I'm looking for whether there's unexpected changes in how
tensorflow-datasets
rely on Beam.We're using Apache Beam Python 3.10 SDK 2.60.0 on Dataflow V2 and strangely the test and validation completes but not training. Is there some size limitation for the sharding logic (433,890 serialized_examples, 1,024 NumberOfShards, 512 written_shards)?
The text was updated successfully, but these errors were encountered: