Valid model.pt for ckpt_path -- Is it a open-source model #100

uni-manjunath-ke · 2024-06-06T06:33:53Z

System Info

I am trying to run this: bash decode_wavlm_large_linear_vicuna_7b.sh

But, not sure, what has to be given for ckpt_path, currently I do not have model.pt? Where do I get this? Is it some open-source model available in hugging face etc.? Please let me know. Currently, it is failing with this error. Thanks @byrTony-Frankzyq

FileNotFoundError: [Errno 2] No such file or directory: '/mnt/efs/manju/if/repos/prompt/slam/output/vicuna-7b-v1.5-librispeech-linear-steplrwarmupkeep1e-4-wavlm-large-20240426/asr_epoch_1_step_1000/model.pt'
Thanks

Information

The official example scripts
My own modified scripts

🐛 Describe the bug

bash decode_wavlm_large_linear_vicuna_7b.sh

Error logs

FileNotFoundError: [Errno 2] No such file or directory: '/mnt/efs/manju/if/repos/prompt/slam/output/vicuna-7b-v1.5-librispeech-linear-steplrwarmupkeep1e-4-wavlm-large-20240426/asr_epoch_1_step_1000/model.pt'

Expected behavior

It is expected to produce output

uni-manjunath-ke · 2024-06-06T06:34:23Z

But, not sure, what has to be given for ckpt_path, currently I do not have model.pt? Where do I get this? Is it some open-source model available in hugging face etc.? Please let me know. Currently, it is failing with this error. Thanks

ddlBoJack · 2024-06-06T06:37:00Z

You can find the projector(model.pt) ckpt in the README page.

uni-manjunath-ke · 2024-06-06T06:42:15Z

Thanks @ddlBoJack .. Will try that

uni-manjunath-ke · 2024-06-06T07:13:09Z

Thanks @ddlBoJack .. This is solved. But, I am facing another issue. I am running this in a 4 GPU machine. I got cuda OOM, when using only single GPU (through CUDA_VISIBLE_DEVICES="0" ). After this, I tried using all 4 GPU's by setting CUDA_VISIBLE_DEVICES="0,1,2,3", but I still get same cuda OOM.

I parallelly monitored nvidia-smi output, and found that the script is using only single GPU, in spite of setting CUDA_VISIBLE_DEVICES="0,1,2,3". Am, I missing anything here. Please suggest. Thanks

Stack trace:
[2024-06-06 06:57:42][slam_model_asr.py][INFO] - loading other parts from: /mnt/efs/manju/if/repos/prompt/slam/output/vicuna-7b-v1.5-librispeech-linear-steplrwarmupkeep1e-4-wavlm-large-20240426/asr_epoch_1_step_1000/model.pt
[2024-06-06 06:57:42][slam_llm.utils.train_utils][INFO] - --> Model asr
[2024-06-06 06:57:42][slam_llm.utils.train_utils][INFO] - --> asr has 18.880512 Million params

Error executing job with overrides: ['++model_config.llm_name=vicuna-7b-v1.5', '++model_config.llm_path=lmsys/vicuna-7b-v1.5', '++model_config.llm_dim=4096', '++model_config.encoder_name=wavlm', '++model_config.normalize=true', '++dataset_config.normalize=true', '++model_config.encoder_projector_ds_rate=5', '++model_config.encoder_path=/mnt/efs/manju/if/repos/prompt/slam/models/WavLM-Large.pt', '++model_config.encoder_dim=1024', '++model_config.encoder_projector=linear', '++dataset_config.dataset=speech_dataset', '++dataset_config.val_data_path=/mnt/efs/manju/if/repos/prompt/slam/data/librispeech_slam_test-clean_bidisha.jsonl', '++dataset_config.input_type=raw', '++dataset_config.inference_mode=true', '++train_config.model_name=asr', '++train_config.freeze_encoder=true', '++train_config.freeze_llm=true', '++train_config.batching_strategy=custom', '++train_config.num_epochs=1', '++train_config.val_batch_size=1', '++train_config.num_workers_dataloader=2', '++train_config.output_dir=/mnt/efs/manju/if/repos/prompt/slam/output/vicuna-7b-v1.5-librispeech-linear-steplrwarmupkeep1e-4-wavlm-large-20240426', '++decode_log=/mnt/efs/manju/if/repos/prompt/slam/output/vicuna-7b-v1.5-librispeech-linear-steplrwarmupkeep1e-4-wavlm-large-20240426/asr_epoch_1_step_1000/decode_librispeech_test_clean_beam4', '++ckpt_path=/mnt/efs/manju/if/repos/prompt/slam/output/vicuna-7b-v1.5-librispeech-linear-steplrwarmupkeep1e-4-wavlm-large-20240426/asr_epoch_1_step_1000/model.pt']
Traceback (most recent call last):
File "/mnt/efs/manju/if/repos/prompt/slam/SLAM-LLM/examples/asr_librispeech/inference_asr_batch.py", line 53, in
main_hydra()
File "/opt/conda/lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main
_run_hydra(
File "/opt/conda/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
_run_app(
File "/opt/conda/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
run_and_report(
File "/opt/conda/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
raise ex
File "/opt/conda/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
return func()
File "/opt/conda/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in
lambda: hydra.run(
File "/opt/conda/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
_ = ret.return_value
File "/opt/conda/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
raise self._return_value
File "/opt/conda/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
ret.return_value = task_function(task_cfg)
File "/mnt/efs/manju/if/repos/prompt/slam/SLAM-LLM/examples/asr_librispeech/inference_asr_batch.py", line 49, in main_hydra
inference(cfg)
File "/workspace/SLAM-LLM/src/slam_llm/pipeline/inference_batch.py", line 102, in main
model.to(device)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1160, in to
return self._apply(convert)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
[Previous line repeated 3 more times]
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 833, in _apply
param_applied = fn(param)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1158, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB. GPU 0 has a total capacty of 21.99 GiB of which 172.62 MiB is free. Process 37130 has 518.00 MiB memory in use. Process 64064 has 21.29 GiB memory in use. Of the allocated memory 21.03 GiB is allocated by PyTorch, and 19.83 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

uni-manjunath-ke · 2024-06-06T07:52:23Z

Does this support multi-GPU? If not how can I resolve this issue. Thanks @ddlBoJack @LauraGPT @chenxie95

ddlBoJack · 2024-06-06T08:05:06Z

Currently we use a single GPU for decoding. We have the plan to support Multi-GPU decoding and the script is on the way.

uni-manjunath-ke · 2024-06-06T10:03:40Z

Ok Thanks.. Does this mean, I should increase the GPU memory and run there? Thanks

ddlBoJack · 2024-06-06T10:10:19Z

What GPU do you use? You can set batch size to 1, and use half-precision to run the inference, if the GPU memory is limited.

uni-manjunath-ke · 2024-06-06T10:16:29Z

We use nvidia a10g GPU with 24 GB GPU memory.. Please guide me on how to set these..

Currently, for batch size is set as " ++train_config.val_batch_size=1 "
Thanks

uni-manjunath-ke · 2024-06-06T10:20:58Z

Hi @ddlBoJack
I tried several combinations by changing below parameters, but nothing works, & everything gives cuda OOM error. What is the minimum GPU needed to run these experiments.
++train_config.use_fp16=true
++train_config.use_peft=true
++train_config.one_gpu=true
++train_config.batch_size_training=1 \

And, Pls suggest, if there is any better way to make this work on GPU with 24 gb memory. Thanks

uni-manjunath-ke · 2024-06-07T05:33:58Z

Hi @ddlBoJack @byrTony-Frankzyq
Could you please suggest on this? Any recommendation on preferred GPU to be used ? Thanks

ddlBoJack · 2024-06-08T12:41:44Z

The minimum GPU we have used is A40 with 48GB memory.

manickavela29 · 2024-06-08T15:01:23Z

Would be great if this is documented as either general requirement,
or for each model combinations on minimum requirement (or experiments conducted on)

will be a great easy start! 🙂

uni-manjunath-ke · 2024-06-09T03:07:55Z

Ok thank you @ddlBoJack . When can we expect multi gpu scripts availability? Any tentative ETA?
Because, we have difficulty in accessing suggested high end GPU's .. so, we would want to try out in multi GPU environment.. thanks

uni-manjunath-ke · 2024-06-10T06:32:17Z

Any update on this? @ddlBoJack @byrTony-Frankzyq Thanks

ddlBoJack · 2024-06-10T13:03:46Z

Hi, we can hardly provide an ETA, since all our contributors are part-time on the project. However, we will try our best to fix the existing bugs and claimed features by users.

jeeyung · 2024-07-01T23:23:07Z

@uni-manjunath-ke I am using the same spec of gpu. Could you let me know how did you run ./decode_wavlm_large_linear_vicuna_7b.sh successfully?

uni-manjunath-ke · 2024-07-02T06:34:32Z

Hi,
I downloaded relevant models (encoders, projectors, llm) from the SLAM_LLM repo.
And, placed in local folders, and updated these paths in ./decode_wavlm_large_linear_vicuna_7b.sh,

I built the slam_llm docker as given in Readme file and ran this script inside docker. It worked. Hope this will be useful.

jeeyung · 2024-07-02T16:31:15Z

@uni-manjunath-ke thank you for the reply! How did you do differently when you got OOM? When you didn't use docker, did you get OOM? I encountered OOM issue even with 'train config.use fp16=true'.

uni-manjunath-ke · 2024-07-02T16:58:31Z

I used 40 GB single GPU with docker. With this, I dint get OOM

Learneducn · 2024-07-21T12:03:30Z

I used 4 46 GB GPUs. I still get an OOM error all the time, how should I fix this?

Learneducn mentioned this issue Jul 20, 2024

Currently we use a single GPU for decoding. We have the plan to support Multi-GPU decoding and the script is on the way. #119

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Valid model.pt for ckpt_path -- Is it a open-source model #100

Valid model.pt for ckpt_path -- Is it a open-source model #100

uni-manjunath-ke commented Jun 6, 2024

uni-manjunath-ke commented Jun 6, 2024

ddlBoJack commented Jun 6, 2024

uni-manjunath-ke commented Jun 6, 2024

uni-manjunath-ke commented Jun 6, 2024 •

edited

Loading

uni-manjunath-ke commented Jun 6, 2024

ddlBoJack commented Jun 6, 2024

uni-manjunath-ke commented Jun 6, 2024

ddlBoJack commented Jun 6, 2024

uni-manjunath-ke commented Jun 6, 2024

uni-manjunath-ke commented Jun 6, 2024 •

edited

Loading

uni-manjunath-ke commented Jun 7, 2024

ddlBoJack commented Jun 8, 2024

manickavela29 commented Jun 8, 2024

uni-manjunath-ke commented Jun 9, 2024

uni-manjunath-ke commented Jun 10, 2024

ddlBoJack commented Jun 10, 2024

jeeyung commented Jul 1, 2024

uni-manjunath-ke commented Jul 2, 2024

jeeyung commented Jul 2, 2024

uni-manjunath-ke commented Jul 2, 2024

Learneducn commented Jul 21, 2024

Valid model.pt for ckpt_path -- Is it a open-source model #100

Valid model.pt for ckpt_path -- Is it a open-source model #100

Comments

uni-manjunath-ke commented Jun 6, 2024

System Info

Information

🐛 Describe the bug

Error logs

Expected behavior

uni-manjunath-ke commented Jun 6, 2024

ddlBoJack commented Jun 6, 2024

uni-manjunath-ke commented Jun 6, 2024

uni-manjunath-ke commented Jun 6, 2024 • edited Loading

uni-manjunath-ke commented Jun 6, 2024

ddlBoJack commented Jun 6, 2024

uni-manjunath-ke commented Jun 6, 2024

ddlBoJack commented Jun 6, 2024

uni-manjunath-ke commented Jun 6, 2024

uni-manjunath-ke commented Jun 6, 2024 • edited Loading

uni-manjunath-ke commented Jun 7, 2024

ddlBoJack commented Jun 8, 2024

manickavela29 commented Jun 8, 2024

uni-manjunath-ke commented Jun 9, 2024

uni-manjunath-ke commented Jun 10, 2024

ddlBoJack commented Jun 10, 2024

jeeyung commented Jul 1, 2024

uni-manjunath-ke commented Jul 2, 2024

jeeyung commented Jul 2, 2024

uni-manjunath-ke commented Jul 2, 2024

Learneducn commented Jul 21, 2024

uni-manjunath-ke commented Jun 6, 2024 •

edited

Loading

uni-manjunath-ke commented Jun 6, 2024 •

edited

Loading