Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using files from other jobs #94

Open
ml-evs opened this issue Mar 15, 2024 · 1 comment
Open

Using files from other jobs #94

ml-evs opened this issue Mar 15, 2024 · 1 comment

Comments

@ml-evs
Copy link
Member

ml-evs commented Mar 15, 2024

It is not uncommon that workflows (in atomate2 or otherwise) need to make use of files created by other jobs, without the data from those files being explicitly stored/serialized. In cases where the file does not map to a pymatgen object (e.g., an arbitrary force field model that has been trained), this is especially awkward.

I would like to handle this in a framework-agnostic way by allowing the user to indicate that a given job depends on the context of another job. At the moment the workflow creator would have to explicitly write a file to an external directory and then access it via an absolute path in the future workflow, but obviously in the case of jobflow-remote those directories may not be on the same machine (and it is good hygiene that a jobflow-remote job only writes to that job's work_dir).

I propose a system where a job can be started directly from a copy of the directory of another job.

For example,

@job 
def train_job(X, model_path="model.pth"):
    model = Model()
    model.train(X)
    model.dump(model_path)

@job 
def inference_job(x, model_path="model.pth"):
    model = Model.load(model_path)
    model.evaluate(x)

train = np.loadtxt("X.csv")
test = np.load("x.csv")

parent = train_job(train)
child = inference_job(test, context=parent)

jobs = [parent(train), child(test)]

This would trigger jobflow-remote to copy all of the contents of workdir of parent into the new workdir of child before submitting the job. When they are being executed on the same worker this transfer is just done directly on the remote, otherwise the copy will need two stages from remote to runner and then runner to remote. Of course this also creates some duplication of data, but it also allows child to potentially (in this case) refit the model and save it again, and think this flexibility might be useful in many cases.

This could then be specialized in the future to explicitly enumerate the files that are required by the child job, e.g.,

child = inference_job(test, context=(parent, ["model.pth"])

with additional checks that model.pth was successfully created by parent (and not deleted on cleanup) during its execution.

I think this would need to be implemented as an extension of the @job decorator, that hopefully we could then add upstream to jobflow. This could then be used directly for atomate2 workflows where this shared filesystem is required. There could be many cases that benefit from this approach, for example restarting from checkpoints from DFT codes natively (I actually don't know how this is handled at the moment).

An alternative approach would be to extend the jobflow Response object to have a special section for files, with an associated store, but I am not sure maggma is flexible enough for this (and not sure if it would in-scope for them, so would require significant new implementation in jobflow itself).

Would be interested to hear people's thoughts! I will play around with implementation of this and can try to make a simple demo.

@gpetretto
Copy link
Contributor

Thanks for raising this issue specifically for jobflow-remote. I do believe that files handling is a big limitation in jobflow and all related packages and I agree that we need to find a solution.

The main issue with the solution proposed is that, among the currently available managers for jobflow, jobflow-remote would be the only one to be able to handle the context argument. In fact, jobflow-remote has access to the informations of the worker where the previous job was executed, while jobs executed in Fireworks will have no obvious way of knowing and connecting to the other workers. Please correct me if I am misunderstanding how this would work.
It is probably not a common occurrence to run two Fireworks on different machines, but it should be possible. As a consequence, will this be accepted into the jobflow repository and actively used in packages like atomate2?
My idea of jobflow-remote is that it should stay as an addition to jobflow. So I would consider reasonable to temporarily have a jobflow-remote version of the @job decorator to test it, but only if it is already agreed that this will be moved to jobflow as soon as it is ready. This will thus require an approval from @utf. I am then wondering if it would not be easier to implement it directly in jobflow.

The concept of the implementation is nice. I don't think I have much to add about it, but I will keep thinking.

A set of notes in random order:

I will reply specifically on the last part about changing the Response to pass files. We have been advocating for the implementation of a specific Store for files, and I believe such a Store would be crucial to deal with large outputs (see e.g. materialsproject/atomate2#515 (comment)). Even though can be probably used, Maggma stores are not really suitable for this purpose. I am not sure if such a kind of store would better fit in maggma or directly in jobflow. I need to open an issue on maggma to discuss this.
That said, I think that there is a only a partial overlap between the general case of copying files from one job to another and storing output files in a Store. For example, if the wavefunctions of a DFT calculation need to be passed to another job, storing several gigabytes of data in a storage is likely not a good solution. So I think it would be better to implement both options.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants