Does jobflow-remote support the pilot job model? #86

Andrew-S-Rosen · 2024-02-29T03:33:48Z

I still haven't used jobflow-remote yet (my apologies), but I had a question I wanted to ask.

The conventional way people used FireWorks was to have each @job be a Slurm job. This can be incredibly inefficient in practice due to queuing policies on the HPC cluster. Lately, many workflow tools built for HPC have adopted a "pilot job" model. In this framework, the daemon will request a Slurm allocation that will run many @jobs concurrently, continually pulling in new @jobs as old ones finish (until the walltime is reached). For example, if you are running VASP with 4 nodes per @job, you might have the daemon request one or more 20-node Slurm jobs, each of which would run at most 5 concurrent VASP @jobs (and when one finishes, a new one is fetched from the database). In Materials Project land, we have called this "jobpacking" in FireWorks, although it was always very hacky.

Does jobflow-remote support such a model for workflow execution? I saw @JaGeo's issue about 10s of thousands of calculations, so I figure it might be the case since no HPC center would be happy with you launching 24,000 Slurm jobs, but I wanted to inquire.

The text was updated successfully, but these errors were encountered:

JaGeo · 2024-02-29T05:10:17Z

@Andrew-S-Rosen our hpc allows a lot of jobs, just not in parallel. (🙈).

gpetretto · 2024-02-29T13:18:46Z

Hi @Andrew-S-Rosen,

if I understand correctly, you are looking for the equivalent of mlaunch (or rlaunch multi) in fireworks. At the moment I have implemented a version that would work more or less like an rlaunch multi 1. Meaning that it can run multiple jobflow Jobs in a single Slurm job, but only sequentially.
I have used often this functionality in the past with Fireworks, while I was always a bit skeptical of the advantages of having rlaunch multi N with N>1. But I suppose that it would depend on the limitations from the computing center.

I called the equivalent of rlaunch multi 1 a "batch" job submission (open to rename it, if it does not convey the meaning). It is already implemented in the current version. I have just written the documentation and I will push it in the next days. I admit it is quite hackish, since I wanted to preserve the basic principle of jobflow-remote to avoid connecting to the DB from the worker, while keeping the advantage of fireworks of not having to decide beforehand which Jobs will be executed inside a single slurm job.
Also, I only tried it on simple examples. I am still looking for testers, if anyone is interested. 😄

I would say that if this implementation proves to be stable, it can probably be extended to mimic the rlaunch multi N.

Andrew-S-Rosen · 2024-02-29T17:31:58Z

Thanks for the reply, @gpetretto!

It sounds like it may be similar to mlaunch. On MP, we would request (say) a 1024-node Slurm job and run (for instance) 4-node VASP calculations across the Slurm job, continually bringing in new jobs until the 48 hr walltime was hit. This is important because NERSC policies only allow two Slurm jobs to age at a time, plus there are discounts for large jobs.

In any case, it sounds like you have put together the necessary backbone for a feature at least somewhat smiilar to this. Good to know. Thanks for the update!

gpetretto · 2024-03-01T10:42:08Z

Thanks for explaining your use case.

Maybe I am being optimistic, but I would say that the most annoying part was having something that handles multiple Jobs in a single Slurm job. I would expect that extending this to have multiple Jobs in parallel would boil down to replicate this part of the code from fireworks: https://github.com/materialsproject/fireworks/blob/e9f2004d3d861e2de288f0023ecdb5ccd9114939/fireworks/features/multi_launcher.py
Which I suppose could stay relatively similar.

If not having this feature is major limitation in adopting jobflow-remote I can prioritize its implementation. However, I would need someone testing it.

Andrew-S-Rosen · 2024-03-01T18:14:05Z

I agree that it seems quite doable!

This isn't a blocker for me at the moment --- the blocker for me is that I have no students yet. 😅 But it will eventually be something needed by MP when they decide to switch their production campaign from FireWorks to jobflow-remote (which I imagine is not something that will happen immediately). Tagging @esoteric-ephemera and @munrojm for context.

Regardless, I would be happy to test if/when the time comes.

gpetretto mentioned this issue Sep 5, 2024

Parallel batch submission #172

Merged

gpetretto closed this as completed in #172 Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does jobflow-remote support the pilot job model? #86

Does jobflow-remote support the pilot job model? #86

Andrew-S-Rosen commented Feb 29, 2024 •

edited

Loading

JaGeo commented Feb 29, 2024

gpetretto commented Feb 29, 2024

Andrew-S-Rosen commented Feb 29, 2024

gpetretto commented Mar 1, 2024

Andrew-S-Rosen commented Mar 1, 2024 •

edited

Loading

Does jobflow-remote support the pilot job model? #86

Does jobflow-remote support the pilot job model? #86

Comments

Andrew-S-Rosen commented Feb 29, 2024 • edited Loading

JaGeo commented Feb 29, 2024

gpetretto commented Feb 29, 2024

Andrew-S-Rosen commented Feb 29, 2024

gpetretto commented Mar 1, 2024

Andrew-S-Rosen commented Mar 1, 2024 • edited Loading

Andrew-S-Rosen commented Feb 29, 2024 •

edited

Loading

Andrew-S-Rosen commented Mar 1, 2024 •

edited

Loading