-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Syncing and NetCDF conversion for ESM1.5 #463
Comments
Postscript?An option could be to run the postscript: my_script.sh The command is run in payu here: Lines 834 to 837 in e9bd1f4
If Passing specific environment vars such as current output dir could be done, using
Sync Userscript?So the sync job has a Adding Dependency Logic to Payu?Using PBS Job Dependency with Getting the Passing it the sync call would require adding an argument to the payu/payu/subcommands/sync_cmd.py Line 61 in e9bd1f4
Or add it somewhere in Line 26 in e9bd1f4
A thing I am not too sure on is what dependency type to use. I think a big issue could be if a dependency is not met, the job stays in the queue indefinitely. So using There is also caveats in NCI docs for Job Dependencies. Say if job2 was submitted with I think it'll be more straightforward to implement dependency logic with Hope this makes sense.. I have only really learnt about configuring PBS dependency today so I could fully be wrong.. |
Thanks for writing these explanations up, these look like great ideas.
I think this sounds like a good option, especially if it makes it viable down the line to add in dependency logic.
We'll want the ESM1.5 netcdf conversion to run both when the syncing is and isn't enabled, and so I agree that this option wouldn't work in every case.
I found the online documentation about the different dependency types a bit unclear about how they actually behave. I ran a quick test on gadi for the
and then a dependent job which just prints out a message:
The dependent job was held in the queue
but looks like it got deleted once the first job failed
and so I'm wondering if this condition could work for holding the sync jobs? We'd probably need some way of reporting to the user if the sync job didn't run though, but I'm not sure of any good ways to do that.
This makes a lot of sense. The postscript option sounds a lot easier and so I'll test out running the conversion as a postscript. |
I've just tried running the conversion as a postscript. It looks like it works well when the E.g. adding the def postprocess(self):
"""Submit any postprocessing scripts or remote syncing if enabled"""
self.set_userscript_env_vars()
# First submit postprocessing script
if self.postscript:
envmod.setup()
envmod.module('load', 'pbs')
cmd = 'qsub {script}'.format(script=self.postscript)
if needs_subprocess_shell(cmd):
sp.check_call(cmd, shell=True)
else:
sp.check_call(shlex.split(cmd))
... Lets the following run ok: postscript: -v PAYU_CURRENT_OUTPUT_DIR -P ${PROJECT} -lstorage=${PBS_NCI_STORAGE}+gdata/access+gdata/hh5 ./scripts/NetCDF-conversion/UM_conversion_job.sh Would these changes be ok to add? There could be a better way of passing the |
Oh that's awesome! There are log files that are created if
Yay that postscript almost works! I reckon those changes will be good to add |
Great discussion of the issues! It's good postscript: -v PAYU_CURRENT_OUTPUT_DIR -P ${PROJECT} -lstorage=${PBS_NCI_STORAGE}+gdata/access+gdata/hh5 ./scripts/NetCDF-conversion/UM_conversion_job.sh That is some code spaghetti. An option that has not been suggested so far is to add a We should definitely discuss before trying to implement it, but thought I'd chuck it in as a possibility. |
That is a good point. Another similar option to having it run in the collate job, is to add a |
After discussing we think it can be reduced to this
Maybe |
I'm actually starting to think we should turn the delete option on by default because most users wouldn't really have the ability to do detailed checks, and then the instructions to turn it off are clearer because it is commenting out an option that is already there. Otherwise we risk users naively running and filling the disk with a large amount of useless data. If we didn't delete them would the um fields fields be sync'ed if syncing was turned on? I think they probably would because we're not specifying any wildcards to ignore (@jo-basevi can advise on how to do this). |
Sorry I think I made a mistake yesterday, it would not have the access to the conda environment. During the run PBS job, payu is run using the path to the python executable and path to where payu is installed -
As it never actually loads the payu environment on the We could add access to the python executable to environment of the postscript call - similar to this PR - #491? In a way it could be useful to load the payu module in the PBS payu-run job to set the environment variables but I am unsure of a clean way to do it without specifying a payu modulepath and filename in |
Some related updates on using the Outside of PBS job:
Inside the PBS job:
When the
STDERR files in each case:
The only one that worked was to omit the
|
I thought one way to simplify the script call would be to move the
and then in
which would have the benefit of controlling the This gives a qsub usage error.
Not sure if I'm interpreting this correctly, but does it look like command line arguments are only allowed when submitting a command instead of a script? If we had the payu module loaded into the payu-run job, do you think it would be possible to omit the
With the above, we might have to add the I might be incorrect about quite a bit of this though! |
That's really interesting, maybe module loading payu as part of payu-run may not be the best idea..
I guess one way to test it would be to add payu to modules loaded in
Though I don't think this is a great idea having to specify a payu-version. Another way to avoid loading the payu module in the postscript, could be to pass
|
The NetCDF conversion script for ESM1.5 will be run as an
archive
stage userscript, where it is submitted to the queue as a PBS job to avoid holding up the simulation. If syncing is enabled inconfig.yaml
, there's no guarantee that the conversion job will be finished by the time the syncing job is run. E.g. testing several month long simulations with syncing enabled, some month's NetCDF files were not all copied over to the syncing directory:From what I can understand, the collation step gets around a similar issue by calling the sync step from within the collation job:
payu/payu/subcommands/collate_cmd.py
Lines 110 to 112 in 2391485
and but a similar approach might not be possible with the external conversion userscript. Just flagging this now so that we can discuss ideas to handle this.
The text was updated successfully, but these errors were encountered: