Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CalledProcessError: 9 #854

Open
raitis-b opened this issue May 17, 2024 · 4 comments · May be fixed by #943
Open

CalledProcessError: 9 #854

raitis-b opened this issue May 17, 2024 · 4 comments · May be fixed by #943
Labels
bug Something isn't working

Comments

@raitis-b
Copy link

Hi,

I tried to run openFE tutorial on my laptop and everything worked just fine, but when I tried to run it on our cluster I faced an issue. On a gpu node it gave an error that has been mentioned before (GPU in 'Exclusive_Process' mode (or Prohibited), one context is allowed per device. This may prevent some openmmtools features from working. GPU must be in 'Default' compute mode). While we fix this issue, I wanted to run it without the gpu, but this led to another error:

$ openfe quickrun transformations/easy_rbfe_lig_ejm_31_solvent_lig_ejm_46_solvent.json -o results/easy_rbfe_lig_ejm_31_solvent_lig_ejm_46_solvent_login_node.json -d results/easy_rbfe_lig_ejm_31_solvent_lig_ejm_46_solvent_login_node

Loading file...
Planning simulations for this edge...
Starting the simulations for this edge...
Done with all simulations! Analyzing the results....
Here is the result:
dG = None ± None

Error: The protocol unit 'lig_ejm_31 to lig_ejm_46 repeat 2 generation 0' failed with the error message:
CalledProcessError: 9

Details provided in output.

The only output is the .json file that is attached.

Cheers,
Raitis

easy_rbfe_lig_ejm_31_solvent_lig_ejm_46_solvent_no_gpu.json

@mikemhenry mikemhenry added the bug Something isn't working label May 17, 2024
@mikemhenry
Copy link
Contributor

@raitis-b

Thank you for the bug report! Looking at the json file and cleaning it up a bit (I just used firefox to view it, it does a decent job rendering these json files) it looks like

Traceback (most recent call last):
  File "/mnt/home/bobrovs/software/miniforge3/envs/openfe_env/lib/python3.12/site-packages/gufe/protocols/protocolunit.py", line 320, in execute
    outputs = self._execute(context, **inputs)
  File "/mnt/home/bobrovs/software/miniforge3/envs/openfe_env/lib/python3.12/site-packages/openfe/protocols/openmm_rfe/equil_rfe_methods.py", line 1127, in _execute
    log_system_probe(logging.INFO, paths=[ctx.scratch])
  File "/mnt/home/bobrovs/software/miniforge3/envs/openfe_env/lib/python3.12/site-packages/openfe/utils/system_probe.py", line 502, in log_system_probe
    sysinfo = _probe_system(pl_paths)['system information']
  File "/mnt/home/bobrovs/software/miniforge3/envs/openfe_env/lib/python3.12/site-packages/openfe/utils/system_probe.py", line 471, in _probe_system
    gpu_info = _get_gpu_info()
  File "/mnt/home/bobrovs/software/miniforge3/envs/openfe_env/lib/python3.12/site-packages/openfe/utils/system_probe.py", line 340, in _get_gpu_info
    nvidia_smi_output = subprocess.check_output(
  File "/mnt/home/bobrovs/software/miniforge3/envs/openfe_env/lib/python3.12/subprocess.py", line 466, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/mnt/home/bobrovs/software/miniforge3/envs/openfe_env/lib/python3.12/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['nvidia-smi', '--query-gpu=gpu_uuid,gpu_name,compute_mode,pstate,temperature.gpu,utilization.memory,memory.total,driver_version,', '--format=csv']' returned non-zero exit status 9.

the nvidia-smi command failed. Could you run nvidia-smi on the same machine/node where you ran the simulation and report back what it does? Code 9 is sigkill so I think that command got killed by some other process.

Regardless, we want to make sure this command doesn't prevent a simulation from running, so we need to enhance our error handling of it.

@raitis-b
Copy link
Author

When I am not asking for the GPU in the queuing script and want to run only on the CPU, the nvidia-smi output is:
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

@PabloNA97
Copy link

PabloNA97 commented Sep 27, 2024

Exact same issue here, thanks @raitis-b and @mikemhenry

@mikemhenry
Copy link
Contributor

mikemhenry commented Sep 27, 2024

I much rather fix this by expanding the errors we catch here:

except FileNotFoundError:

We could also just have a bare except, print the error raised as a warning the GPU check failed, and continue on

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants