Using files from other jobs #94

ml-evs · 2024-03-15T11:41:42Z

It is not uncommon that workflows (in atomate2 or otherwise) need to make use of files created by other jobs, without the data from those files being explicitly stored/serialized. In cases where the file does not map to a pymatgen object (e.g., an arbitrary force field model that has been trained), this is especially awkward.

I would like to handle this in a framework-agnostic way by allowing the user to indicate that a given job depends on the context of another job. At the moment the workflow creator would have to explicitly write a file to an external directory and then access it via an absolute path in the future workflow, but obviously in the case of jobflow-remote those directories may not be on the same machine (and it is good hygiene that a jobflow-remote job only writes to that job's work_dir).

I propose a system where a job can be started directly from a copy of the directory of another job.

For example,

@job 
def train_job(X, model_path="model.pth"):
    model = Model()
    model.train(X)
    model.dump(model_path)

@job 
def inference_job(x, model_path="model.pth"):
    model = Model.load(model_path)
    model.evaluate(x)

train = np.loadtxt("X.csv")
test = np.load("x.csv")

parent = train_job(train)
child = inference_job(test, context=parent)

jobs = [parent(train), child(test)]

This would trigger jobflow-remote to copy all of the contents of workdir of parent into the new workdir of child before submitting the job. When they are being executed on the same worker this transfer is just done directly on the remote, otherwise the copy will need two stages from remote to runner and then runner to remote. Of course this also creates some duplication of data, but it also allows child to potentially (in this case) refit the model and save it again, and think this flexibility might be useful in many cases.

This could then be specialized in the future to explicitly enumerate the files that are required by the child job, e.g.,

child = inference_job(test, context=(parent, ["model.pth"])

with additional checks that model.pth was successfully created by parent (and not deleted on cleanup) during its execution.

I think this would need to be implemented as an extension of the @job decorator, that hopefully we could then add upstream to jobflow. This could then be used directly for atomate2 workflows where this shared filesystem is required. There could be many cases that benefit from this approach, for example restarting from checkpoints from DFT codes natively (I actually don't know how this is handled at the moment).

An alternative approach would be to extend the jobflow Response object to have a special section for files, with an associated store, but I am not sure maggma is flexible enough for this (and not sure if it would in-scope for them, so would require significant new implementation in jobflow itself).

Would be interested to hear people's thoughts! I will play around with implementation of this and can try to make a simple demo.

The text was updated successfully, but these errors were encountered:

gpetretto · 2024-03-15T14:17:56Z

Thanks for raising this issue specifically for jobflow-remote. I do believe that files handling is a big limitation in jobflow and all related packages and I agree that we need to find a solution.

The main issue with the solution proposed is that, among the currently available managers for jobflow, jobflow-remote would be the only one to be able to handle the context argument. In fact, jobflow-remote has access to the informations of the worker where the previous job was executed, while jobs executed in Fireworks will have no obvious way of knowing and connecting to the other workers. Please correct me if I am misunderstanding how this would work.
It is probably not a common occurrence to run two Fireworks on different machines, but it should be possible. As a consequence, will this be accepted into the jobflow repository and actively used in packages like atomate2?
My idea of jobflow-remote is that it should stay as an addition to jobflow. So I would consider reasonable to temporarily have a jobflow-remote version of the @job decorator to test it, but only if it is already agreed that this will be moved to jobflow as soon as it is ready. This will thus require an approval from @utf. I am then wondering if it would not be easier to implement it directly in jobflow.

The concept of the implementation is nice. I don't think I have much to add about it, but I will keep thinking.

A set of notes in random order:

what if files need to come from more than one job? Would you handle this case as well?
In general I think that copying all the files from a previous directory could be an option, but should be discouraged. Except simple cases I can images all sort of issues coming from copying all the files from a previous job. So the I would definitely include the option with context=(parent, ["model.pth"])
Did you consider something like the FileClient in atomate2?
- https://github.com/materialsproject/atomate2/blob/22b2fa0f7152aa7716906da4cf08672b8960d45d/src/atomate2/utils/file_client.py#L22
- https://github.com/materialsproject/atomate2/blob/22b2fa0f7152aa7716906da4cf08672b8960d45d/src/atomate2/vasp/files.py#L27)
in the case of jobflow remote I am wondering if it would be possible/preferable to make the transfer directly from one worker to the other, rather than triangulating through the runner. In case of large files this will take twice the time and block the runner longer. (maybe using something like the FileClient above?)
Since you were mentioning it, as far as I know, checkpointing is not handled at the moment.

I will reply specifically on the last part about changing the Response to pass files. We have been advocating for the implementation of a specific Store for files, and I believe such a Store would be crucial to deal with large outputs (see e.g. materialsproject/atomate2#515 (comment)). Even though can be probably used, Maggma stores are not really suitable for this purpose. I am not sure if such a kind of store would better fit in maggma or directly in jobflow. I need to open an issue on maggma to discuss this.
That said, I think that there is a only a partial overlap between the general case of copying files from one job to another and storing output files in a Store. For example, if the wavefunctions of a DFT calculation need to be passed to another job, storing several gigabytes of data in a storage is likely not a good solution. So I think it would be better to implement both options.

ml-evs added the suggestion label Mar 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using files from other jobs #94

Using files from other jobs #94

ml-evs commented Mar 15, 2024 •

edited

Loading

gpetretto commented Mar 15, 2024

Using files from other jobs #94

Using files from other jobs #94

Comments

ml-evs commented Mar 15, 2024 • edited Loading

gpetretto commented Mar 15, 2024

ml-evs commented Mar 15, 2024 •

edited

Loading