Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cmip6 wrf wus #247

Open
wants to merge 10 commits into
base: master
Choose a base branch
from
Open

Cmip6 wrf wus #247

wants to merge 10 commits into from

Conversation

thenaomig
Copy link

This is a test with one of many simulations from CMIP6 downscaled with WRF at UCLA.

@andersy005
Copy link
Member

@thenaomig and I have been working on this recipe at the post AMS Pangeo meeting :)

@andersy005
Copy link
Member

pre-commit.ci autofix

@andersy005
Copy link
Member

pre-commit.ci autofix

@andersy005
Copy link
Member

/run cmip6-wrf-wus

@cisaacstern
Copy link
Member

@andersy005 unfortunately, the backend service is currently broken, following my failed attempt to upgrade the pangeo-forge-recipes version used there. I am working on getting it fixed and will ping you here once that's the case. (Currently, jobs will submit, but they will fail because of version mismatching between the backend service client and Dataflow workers.)

@andersy005
Copy link
Member

thank you for the heads up, @cisaacstern! yeah, we can definitely wait until the issue is resolved. i just wanted to make sure @thenaomig was able to submit the recipe for the end of the workshop.

@cisaacstern
Copy link
Member

Thanks for the contribution, @thenaomig! We'll have this all working again shortly. 🙏

@andersy005
Copy link
Member

pre-commit.ci autofix

@andersy005
Copy link
Member

andersy005 commented Jan 13, 2023

@cisaacstern, we are getting

TypeError: object of type FilePattern not serializable

do you happen to know why this is happening now (as far as i can tell, this issue wasn't there until we switched to a dict of recipes)

@andersy005
Copy link
Member

@cisaacstern, we are getting

TypeError: object of type FilePattern not serializable

do you happen to know why this is happening now (as far as i can tell, this issue wasn't there until we switched to a dict of recipes)

never mind.. @thenaomig found the problem.

@andersy005
Copy link
Member

/run cesm2_r11i1p1f1_ssp370

@cisaacstern
Copy link
Member

@andersy005 unfortunately job submission failed.

I've opened pangeo-forge/pangeo-forge-orchestrator#220 to track down why.

@andersy005
Copy link
Member

thank you for the update, @cisaacstern! i'll take a look at the issue you link to see if i can help diagnose it.

@pangeo-forge
Copy link
Contributor

pangeo-forge bot commented Jan 18, 2023

The test failed, but I'm sure we can find out why!

Pangeo Forge maintainers are working diligently to provide public logs for contributors.
That feature is not quite ready yet, however, so please reach out on this thread to a
maintainer, and they'll help you diagnose the problem.

@cisaacstern
Copy link
Member

So there are now two concurrent issues going on here:

  1. The production deployment still appears to have a bug related to job submission, as discussed in Is dataflow job submission still broken? pangeo-forge-orchestrator#220.

  2. The last message from the pangeo-forge app reporting a test run failure is the result of my manually deploying this job (as part of debugging problem 1). The backend logs show this error:

    RuntimeError: botocore.exceptions.NoCredentialsError: Unable to locate credentials [while running 'Start|scan_file|Reshuffle_000|finalize|Reshuffle_001/scan_file/Execute-ptransform-56']
    "
    

    which I believe is related to the fact that the data for this recipe are being pulled from an s3:// url.

Regarding problem 2, is there an HTTP endpoint for this data?

@thenaomig
Copy link
Author

Regarding problem 2, is there an HTTP endpoint for this data?

Hmm, I don't suppose this helps?

@cisaacstern
Copy link
Member

Apparently our logs hosting service (Papertrail) is currently down? Can't catch a break! 🤦

I'll check why the CI synchronize task is hanging as soon as it becomes available.

@andersy005
Copy link
Member

Apparently our logs hosting service (Papertrail) is currently down? Can't catch a break! 🤦

sorry @cisaacstern :(. thank you for the updates

If necessary, I would be happy to assist in debugging this tomorrow

@cisaacstern
Copy link
Member

Wow looks like pangeo-forge/pangeo-forge-orchestrator#221 did solve the hanging CI.

I'll try re-triggering the test run of the recipe now. 🤞

@cisaacstern
Copy link
Member

/run cesm2_r11i1p1f1_ssp370

@cisaacstern
Copy link
Member

The last test run submission failed as well. I've done a bit of digging on this, and discovered that this recipe appears to be generating an unusually large Beam pipeline artifact.

Brief background: when a job is submitted to Dataflow, the recipe module is compiled to an Apache Beam pipeline object, which is then serialized (pickled) and uploaded (cached) to Google Cloud Storage (GCS). When Dataflow starts up, it grabs this serialized artifact from GCS, de-serializes (un-pickles) it, and uses it to start the pipeline.

Currently, we have around 150 serialized pipeline artifacts stored in GCS from recent Pangeo Forge recipe runs. The majority of these artifacts are in the range of 0.15-0.30 MB (150-300 KB).

The one job which has been run from this PR is the job which I mentioned having manually deployed during the course of debugging. This was the job associated with recipe run 1486. (That link doesn't make this fact too obvious, but you'll note that the Git SHA there is 34a4f3f, which is part of this PR.)

The pipeline artifact for (the manually deployed) recipe run 1486 is 4.39 MB (I've removed the other x tick labels for clarity):

Screen Shot 2023-01-20 at 2 08 32 PM

Though I can't say I know why the pipeline artifact is so large for this recipe, the fact that it is, may be a clue as to why this particular recipe is causing worker timeout / OOM conditions, which I've also just documented a bit further in pangeo-forge/pangeo-forge-orchestrator#220 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants