Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report issue saving checkpoint #386

Open
acerdenno opened this issue Oct 1, 2024 · 4 comments
Open

Report issue saving checkpoint #386

acerdenno opened this issue Oct 1, 2024 · 4 comments

Comments

@acerdenno
Copy link

When running cellbender in slurm, two different errors prompt:
1.- when: Saving a checkpoint...
cellbender:remove-background: Could not save checkpoint
2.- TypeError: cannot pickle 'weakref' object
Any clues on how to solve them? Thanks!

@JThomasWatson
Copy link

JThomasWatson commented Oct 22, 2024

I'm encountering the same error as #2. Below is the error message, in cast it's helpful.

cellbender:remove-background: Saving a checkpoint...
cellbender:remove-background: Could not save checkpoint
cellbender:remove-background: Traceback (most recent call last):
  File "/projects/b1169/thomas/CellbenderEnv/CellBender/cellbender/remove_background/checkpoint.py", line 115, in save_checkpoint
    torch.save(model_obj, filebase + '_model.torch')
  File "/projects/b1169/thomas/CellbenderEnv/env/Cellbender/lib/python3.9/site-packages/torch/serialization.py", line 652, in save
    _save(obj, opened_zipfile, pickle_module, pickle_protocol, _disable_byteorder_record)
  File "/projects/b1169/thomas/CellbenderEnv/env/Cellbender/lib/python3.9/site-packages/torch/serialization.py", line 864, in _save
    pickler.dump(obj)
TypeError: cannot pickle 'weakref' object

cellbender:remove-background: 2024-10-22 14:36:38
cellbender:remove-background: Inference procedure complete.
Traceback (most recent call last):
  File "/projects/b1169/thomas/CellbenderEnv/env/Cellbender/bin/cellbender", line 8, in <module>
    sys.exit(main())
  File "/projects/b1169/thomas/CellbenderEnv/CellBender/cellbender/base_cli.py", line 118, in main
    cli_dict[args.tool].run(args)
  File "/projects/b1169/thomas/CellbenderEnv/CellBender/cellbender/remove_background/cli.py", line 193, in run
    return main(args)
  File "/projects/b1169/thomas/CellbenderEnv/CellBender/cellbender/remove_background/cli.py", line 227, in main
    posterior = run_remove_background(args)
  File "/projects/b1169/thomas/CellbenderEnv/CellBender/cellbender/remove_background/run.py", line 123, in run_remove_background
    posterior = load_or_compute_posterior_and_save(
  File "/projects/b1169/thomas/CellbenderEnv/CellBender/cellbender/remove_background/posterior.py", line 59, in load_or_compute_posterior_and_save
    assert os.path.exists(args.input_checkpoint_tarball), \
AssertionError: Checkpoint file ckpt.tar.gz does not exist, presumably because saving of the checkpoint file has been manually interrupted. load_or_compute_posterior_and_save() will not work properly without an existing checkpoint file. Please re-run and allow a checkpoint file to be saved.

Could this be an issue with torch version?

@mcsimenc
Copy link

mcsimenc commented Nov 2, 2024

I ran cellbender for the first time, using CPU, not using a cluster, and get the same error, with no output produced, although at the end of the log it says "Inference procedure complete.". The call and the log file output are below.

cellbender remove-background \
        --input raw_feature_bc_matrix.h5 \
        --output raw_feature_bc_matrix.nuclei.h5 \
        --cpu-threads 24 \
        >cb.out 2>cb.err
(base) [msimenc@KIWI outs]$ cat raw_feature_bc_matrix.nuclei.log 
cellbender:remove-background: Command:
cellbender remove-background --input raw_feature_bc_matrix.h5 --output raw_feature_bc_matrix.nuclei.h5 --cpu-threads 24
cellbender:remove-background: CellBender 0.3.0
cellbender:remove-background: (Workflow hash 8ebc86ffdb)
cellbender:remove-background: 2024-11-01 17:16:03
cellbender:remove-background: Running remove-background
cellbender:remove-background: Loading data from raw_feature_bc_matrix.h5
cellbender:remove-background: CellRanger v3 format
cellbender:remove-background: Features in dataset: 30940 Gene Expression
cellbender:remove-background: Trimming features for inference.
cellbender:remove-background: 24319 features have nonzero counts.
cellbender:remove-background: Prior on counts for cells is 911
cellbender:remove-background: Prior on counts for empty droplets is 198
cellbender:remove-background: Excluding 1976 features that are estimated to have <= 0.1 background counts in cells.
cellbender:remove-background: Including 22343 features in the analysis.
cellbender:remove-background: Trimming barcodes for inference.
cellbender:remove-background: Excluding barcodes with counts below 99
cellbender:remove-background: Using 3155 probable cell barcodes, plus an additional 9078 barcodes, and 49577 empty droplets.
cellbender:remove-background: Largest surely-empty droplet has 343 UMI counts.
cellbender:remove-background: Attempting to unpack tarball "ckpt.tar.gz" to /tmp/tmphjs6xrze
cellbender:remove-background: No saved checkpoint.
cellbender:remove-background: No checkpoint loaded.
cellbender:remove-background: Running inference...
cellbender:remove-background: [epoch 001]  average training loss: 2895.4787
cellbender:remove-background: [epoch 002]  average training loss: 2773.4995  (100.7 seconds per epoch)
cellbender:remove-background: Will checkpoint every 5 epochs
cellbender:remove-background: [epoch 003]  average training loss: 2684.2793
cellbender:remove-background: [epoch 004]  average training loss: 2610.8373
cellbender:remove-background: [epoch 005]  average training loss: 2557.5633
cellbender:remove-background: [epoch 005] average test loss: 2566.8680
cellbender:remove-background: Saving a checkpoint...
cellbender:remove-background: Could not save checkpoint
cellbender:remove-background: Traceback (most recent call last):
  File "/home/msimenc/software/miniforge3/envs/snake-cellranger/lib/python3.12/site-packages/cellbender/remove_background/checkpoint.py", line 115, in save_checkpoint
    torch.save(model_obj, filebase + '_model.torch')
  File "/home/msimenc/software/miniforge3/envs/snake-cellranger/lib/python3.12/site-packages/torch/serialization.py", line 850, in save
    _save(
  File "/home/msimenc/software/miniforge3/envs/snake-cellranger/lib/python3.12/site-packages/torch/serialization.py", line 1088, in _save
    pickler.dump(obj)
TypeError: cannot pickle 'weakref.ReferenceType' object
.
.
.
more epochs reports, more of the same error,
.
.
.
TypeError: cannot pickle 'weakref.ReferenceType' object

cellbender:remove-background: 2024-11-01 20:00:22
cellbender:remove-background: Inference procedure complete.

The /tmp dir is writable:

(base) [msimenc@KIWI outs]$ ls -l /
drwxrwxrwt.   16 root root    20480 Nov  1 22:13 tmp

I just installed cellbender using pip this afternoon. Any ideas?

@ezgiisenn
Copy link

I've been successfully running cellbender version 0.3.0 and 0.3.2 on our LSF-based computing cluster without issues until recently. However, in the past month, I’ve also started encountering the same error: TypeError: cannot pickle 'weakref.ReferenceType' object. Suggestions are appreciated to tackle the issue, thank you in advance!

@GFrosi
Copy link

GFrosi commented Nov 7, 2024

Hi,

I am getting the same error using cellbender 0.3.0. I installed it via pip (python 3.11.5) in the HPC. I did not run it on my data. I am just trying to use the example data from the github, and the error is there.

Any updates about the issue? It would be super helpful.

AssertionError: Checkpoint file ckpt.tar.gz does not exist, presumably because saving of the checkpoint file has been manually interrupted. load_or_compute_posterior_and_save() will not work properly without an existing checkpoint file. Please re-run and allow a checkpoint file to be saved.

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants