Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does jobflow-remote support the pilot job model? #86

Closed
Andrew-S-Rosen opened this issue Feb 29, 2024 · 5 comments · Fixed by #172
Closed

Does jobflow-remote support the pilot job model? #86

Andrew-S-Rosen opened this issue Feb 29, 2024 · 5 comments · Fixed by #172

Comments

@Andrew-S-Rosen
Copy link
Collaborator

Andrew-S-Rosen commented Feb 29, 2024

I still haven't used jobflow-remote yet (my apologies), but I had a question I wanted to ask.

The conventional way people used FireWorks was to have each @job be a Slurm job. This can be incredibly inefficient in practice due to queuing policies on the HPC cluster. Lately, many workflow tools built for HPC have adopted a "pilot job" model. In this framework, the daemon will request a Slurm allocation that will run many @jobs concurrently, continually pulling in new @jobs as old ones finish (until the walltime is reached). For example, if you are running VASP with 4 nodes per @job, you might have the daemon request one or more 20-node Slurm jobs, each of which would run at most 5 concurrent VASP @jobs (and when one finishes, a new one is fetched from the database). In Materials Project land, we have called this "jobpacking" in FireWorks, although it was always very hacky.

Does jobflow-remote support such a model for workflow execution? I saw @JaGeo's issue about 10s of thousands of calculations, so I figure it might be the case since no HPC center would be happy with you launching 24,000 Slurm jobs, but I wanted to inquire.

@JaGeo
Copy link
Collaborator

JaGeo commented Feb 29, 2024

@Andrew-S-Rosen our hpc allows a lot of jobs, just not in parallel. (🙈).

@gpetretto
Copy link
Contributor

Hi @Andrew-S-Rosen,

if I understand correctly, you are looking for the equivalent of mlaunch (or rlaunch multi) in fireworks. At the moment I have implemented a version that would work more or less like an rlaunch multi 1. Meaning that it can run multiple jobflow Jobs in a single Slurm job, but only sequentially.
I have used often this functionality in the past with Fireworks, while I was always a bit skeptical of the advantages of having rlaunch multi N with N>1. But I suppose that it would depend on the limitations from the computing center.

I called the equivalent of rlaunch multi 1 a "batch" job submission (open to rename it, if it does not convey the meaning). It is already implemented in the current version. I have just written the documentation and I will push it in the next days. I admit it is quite hackish, since I wanted to preserve the basic principle of jobflow-remote to avoid connecting to the DB from the worker, while keeping the advantage of fireworks of not having to decide beforehand which Jobs will be executed inside a single slurm job.
Also, I only tried it on simple examples. I am still looking for testers, if anyone is interested. 😄

I would say that if this implementation proves to be stable, it can probably be extended to mimic the rlaunch multi N.

@Andrew-S-Rosen
Copy link
Collaborator Author

Thanks for the reply, @gpetretto!

It sounds like it may be similar to mlaunch. On MP, we would request (say) a 1024-node Slurm job and run (for instance) 4-node VASP calculations across the Slurm job, continually bringing in new jobs until the 48 hr walltime was hit. This is important because NERSC policies only allow two Slurm jobs to age at a time, plus there are discounts for large jobs.

In any case, it sounds like you have put together the necessary backbone for a feature at least somewhat smiilar to this. Good to know. Thanks for the update!

@gpetretto
Copy link
Contributor

Thanks for explaining your use case.

Maybe I am being optimistic, but I would say that the most annoying part was having something that handles multiple Jobs in a single Slurm job. I would expect that extending this to have multiple Jobs in parallel would boil down to replicate this part of the code from fireworks: https://github.com/materialsproject/fireworks/blob/e9f2004d3d861e2de288f0023ecdb5ccd9114939/fireworks/features/multi_launcher.py
Which I suppose could stay relatively similar.

If not having this feature is major limitation in adopting jobflow-remote I can prioritize its implementation. However, I would need someone testing it.

@Andrew-S-Rosen
Copy link
Collaborator Author

Andrew-S-Rosen commented Mar 1, 2024

I agree that it seems quite doable!

This isn't a blocker for me at the moment --- the blocker for me is that I have no students yet. 😅 But it will eventually be something needed by MP when they decide to switch their production campaign from FireWorks to jobflow-remote (which I imagine is not something that will happen immediately). Tagging @esoteric-ephemera and @munrojm for context.

Regardless, I would be happy to test if/when the time comes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants