Remote_error : How to resolve it ? #62

naik-aakash · 2024-02-05T09:30:33Z

It had setup the jobflow remote in User-Workstation config. And it seemed to run fine.

Since yesterday, am getting this REMOTE_ERROR for any kind of Flow I submit, be it simple jobflow or any atomate2 workflow.

The error is related to jfremote_out.json not being found. Any ideas how to debug this error. I cannot seem to figure out what is going wrong here.

Remote hpc uses slurm and job seem to get submitted and then fails

gpetretto · 2024-02-05T09:42:25Z

Hi @naik-aakash,

this typically means that the job is not running being executed. Possibly the cause may be that the environment is not properly activated or jobflow-remote is not properly installed. The best way to identify the source of the error in this case is to check the queue.out/queue.err files in the scratch folder.

ml-evs · 2024-02-05T10:02:43Z

this typically means that the job is not running being executed. Possibly the cause may be that the environment is not properly activated or jobflow-remote is not properly installed. The best way to identify the source of the error in this case is to check the queue.out/queue.err files in the scratch folder.

I was thinking about adding the queue.out/err directly as the error message (when available) here @gpetretto, do you think that would be useful?

gpetretto · 2024-02-05T10:14:46Z

Are you suggesting to always fetch the content of the files, store it in the DB and show in jf job info? Or fetching it on the fly? For the latter there is also the jf job queue-out JOB_ID command, that will fetch them from the remote folder and print their content to screen. Maybe this could be an option for jf job info instead? I would avoid always fetching, as this will slow down the command.

naik-aakash · 2024-02-05T10:47:08Z

Hi @naik-aakash,

this typically means that the job is not running being executed. Possibly the cause may be that the environment is not properly activated or jobflow-remote is not properly installed. The best way to identify the source of the error in this case is to check the queue.out/queue.err files in the scratch folder.

So it seems, somehow on login shell now since yesterday, conda activate env_name / source activate env_name does not seem to work (conda command not found ) .

and this queue.out/err files were not generated.

But thanks for suggestion, I found the root cause.

ml-evs · 2024-02-05T11:56:05Z

Are you suggesting to always fetch the content of the files, store it in the DB and show in jf job info? Or fetching it on the fly? For the latter there is also the jf job queue-out JOB_ID command, that will fetch them from the remote folder and print their content to screen. Maybe this could be an option for jf job info instead? I would avoid always fetching, as this will slow down the command.

I think defaulting to showing the jf job queue-out JOB_ID when asking for -err would be helpful, yeah (so not on-the-fly).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remote_error : How to resolve it ? #62

Remote_error : How to resolve it ? #62

naik-aakash commented Feb 5, 2024 •

edited

Loading

gpetretto commented Feb 5, 2024

ml-evs commented Feb 5, 2024

gpetretto commented Feb 5, 2024

naik-aakash commented Feb 5, 2024

ml-evs commented Feb 5, 2024

Remote_error : How to resolve it ? #62

Remote_error : How to resolve it ? #62

Comments

naik-aakash commented Feb 5, 2024 • edited Loading

gpetretto commented Feb 5, 2024

ml-evs commented Feb 5, 2024

gpetretto commented Feb 5, 2024

naik-aakash commented Feb 5, 2024

ml-evs commented Feb 5, 2024

naik-aakash commented Feb 5, 2024 •

edited

Loading