Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote_error : How to resolve it ? #62

Open
naik-aakash opened this issue Feb 5, 2024 · 5 comments
Open

Remote_error : How to resolve it ? #62

naik-aakash opened this issue Feb 5, 2024 · 5 comments

Comments

@naik-aakash
Copy link

naik-aakash commented Feb 5, 2024

Hi @gpetretto ,

It had setup the jobflow remote in User-Workstation config. And it seemed to run fine.

Since yesterday, am getting this REMOTE_ERROR for any kind of Flow I submit, be it simple jobflow or any atomate2 workflow.

The error is related to jfremote_out.json not being found. Any ideas how to debug this error. I cannot seem to figure out what is going wrong here.

Remote hpc uses slurm and job seem to get submitted and then fails

image

@gpetretto
Copy link
Contributor

Hi @naik-aakash,

this typically means that the job is not running being executed. Possibly the cause may be that the environment is not properly activated or jobflow-remote is not properly installed. The best way to identify the source of the error in this case is to check the queue.out/queue.err files in the scratch folder.

@ml-evs
Copy link
Member

ml-evs commented Feb 5, 2024

this typically means that the job is not running being executed. Possibly the cause may be that the environment is not properly activated or jobflow-remote is not properly installed. The best way to identify the source of the error in this case is to check the queue.out/queue.err files in the scratch folder.

I was thinking about adding the queue.out/err directly as the error message (when available) here @gpetretto, do you think that would be useful?

@gpetretto
Copy link
Contributor

Are you suggesting to always fetch the content of the files, store it in the DB and show in jf job info? Or fetching it on the fly? For the latter there is also the jf job queue-out JOB_ID command, that will fetch them from the remote folder and print their content to screen. Maybe this could be an option for jf job info instead? I would avoid always fetching, as this will slow down the command.

@naik-aakash
Copy link
Author

Hi @naik-aakash,

this typically means that the job is not running being executed. Possibly the cause may be that the environment is not properly activated or jobflow-remote is not properly installed. The best way to identify the source of the error in this case is to check the queue.out/queue.err files in the scratch folder.

So it seems, somehow on login shell now since yesterday, conda activate env_name / source activate env_name does not seem to work (conda command not found ) .

and this queue.out/err files were not generated.

But thanks for suggestion, I found the root cause.

@ml-evs
Copy link
Member

ml-evs commented Feb 5, 2024

Are you suggesting to always fetch the content of the files, store it in the DB and show in jf job info? Or fetching it on the fly? For the latter there is also the jf job queue-out JOB_ID command, that will fetch them from the remote folder and print their content to screen. Maybe this could be an option for jf job info instead? I would avoid always fetching, as this will slow down the command.

I think defaulting to showing the jf job queue-out JOB_ID when asking for -err would be helpful, yeah (so not on-the-fly).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants