-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does jobflow-remote support the pilot job model? #86
Comments
@Andrew-S-Rosen our hpc allows a lot of jobs, just not in parallel. (🙈). |
Hi @Andrew-S-Rosen, if I understand correctly, you are looking for the equivalent of I called the equivalent of I would say that if this implementation proves to be stable, it can probably be extended to mimic the |
Thanks for the reply, @gpetretto! It sounds like it may be similar to In any case, it sounds like you have put together the necessary backbone for a feature at least somewhat smiilar to this. Good to know. Thanks for the update! |
Thanks for explaining your use case. Maybe I am being optimistic, but I would say that the most annoying part was having something that handles multiple If not having this feature is major limitation in adopting jobflow-remote I can prioritize its implementation. However, I would need someone testing it. |
I agree that it seems quite doable! This isn't a blocker for me at the moment --- the blocker for me is that I have no students yet. 😅 But it will eventually be something needed by MP when they decide to switch their production campaign from FireWorks to jobflow-remote (which I imagine is not something that will happen immediately). Tagging @esoteric-ephemera and @munrojm for context. Regardless, I would be happy to test if/when the time comes. |
I still haven't used jobflow-remote yet (my apologies), but I had a question I wanted to ask.
The conventional way people used FireWorks was to have each
@job
be a Slurm job. This can be incredibly inefficient in practice due to queuing policies on the HPC cluster. Lately, many workflow tools built for HPC have adopted a "pilot job" model. In this framework, the daemon will request a Slurm allocation that will run many@job
s concurrently, continually pulling in new@job
s as old ones finish (until the walltime is reached). For example, if you are running VASP with 4 nodes per@job
, you might have the daemon request one or more 20-node Slurm jobs, each of which would run at most 5 concurrent VASP@job
s (and when one finishes, a new one is fetched from the database). In Materials Project land, we have called this "jobpacking" in FireWorks, although it was always very hacky.Does jobflow-remote support such a model for workflow execution? I saw @JaGeo's issue about 10s of thousands of calculations, so I figure it might be the case since no HPC center would be happy with you launching 24,000 Slurm jobs, but I wanted to inquire.
The text was updated successfully, but these errors were encountered: