-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError During Merging Process Possibly Due to Shared Memory Tensors #41
Comments
This also happens with Qwen/Qwen2.5-0.5B
|
Hi @SolshineCode, thanks for trying our work. I think the problem is with the modeling files. As you can see, we support Mistral and Llama3 at the moment. But adding support to other models is straightforward. @thomasgauthier could you please take a look at this further? |
@shamanez I really appreciate your quick reply. This is an awesome program and I'm excited to use it further. I've replicated the error and issue with the meta-llama/Llama-3.2-1B-Instruct model, so I believe this issue also occurs with Llama Architecture. Error Log Snippit:
And the saved merged_model folder again only contains the two config jsons. |
Can you please try this command?
**Also wanted this merge operation to happen on CPUs. I think there's a little bug. But can you please remove the "cuda" command as well? Because this operation doesn't needa GPU.** Then adding to this, the logic behind adding a new model is here - https://github.com/arcee-ai/DAM/blob/main/dam/merge.py#L21 |
I tried that in this context but it seems that's too large an operation for the Google Colab notebook I'm using, so I'll have to check it out when I'm back at a real PC. I was hoping to make and release a notebook that works to do DAM for tiny models on the colab free t4 alotment. It quits out at "pytorch_model-00001-of-00002.bin: 49% 4.83G/9.94G [03:18<03:28, 24.5MB/s]" of the base model download. This DAM project is incredibly cool and reinvigorates my faith in a distributed polysemanticity interpretations for neural network interpretability. Great work! I'll take a look at the logic behind adding a new model and may make a PR for the newer llama models if I can figure it out. It would be good to be able to use this on SOTA tiny LLM architectures. Thanks! |
Confirming, it only works currently for llama 3 (and mistral), not llama 3.2? |
Adding new models is super easy and a two-minute thing. @thomasgauthier, maybe we can add a description to the README. @SolshineCode In the meantime, feel free to do a PR :) . Thanks again for your valuable feedback, |
It would seem a two-minute thing but I'm puzzled why it wouldn't just work with Llama 3, since the llama 3.2 1B model_type is labelled as "llama" in the config on the hf hub, and the merge.py line you linked already accounts for llama model_type. Yet my notebook exhibits the same issue with failure to properly save the merged files with the llama 3.2 1B architecture (noted and screenshotted above.) |
As can be seen here again with Llama 3.2 1B (this time three models listed instead of two):
Results shown in this picture are same as quoted above: |
I submitted a PR to fix the method used in merge.py to save_model |
/ EDIT: I now believe what's covered in this comment is a seperate issue I'll probably open seperately later. / The method (save_model withs safetensors) in my PR worked for that code chunk for the newer architecture of llama 3.2 but failed for the original llama 8B, so I changed to just turning off safetensors (using save_pretrained) which works with both llama versions for that code chunk (executing merge.py,) however I'm then running into another error when trying to train the merged model which I'm not sure if is caused by this change or something else.
I will continue to explore this and maybe others on the project can provide insights as well. Thank you!! |
I now believe these are seperate issues, both arrising from differences with the Llama 3.2 architecture vs the original Llama Architecture. |
The above noted error from dam/train_dam.py stemmed from my switch away from safetensors to pickl files, so I closed my other PR since it would have caused this downstream issues, and am re-investigating how to make this repo compatible with SOTA models such as Llama 3.2. Any input on this issue is warmly welcomed. Thanks |
Description:
I'm encountering an error while trying to merge models using the
merge.py
script. The process loads the models and processes the layers correctly, but when it attempts to save the merged model, aRuntimeError
is raised due to tensors sharing memory. Here's the detailed log:This issue regarding when running following command on notebook
Log Output:
Reproduction Steps:
merge.py
script with the following parameters:save_pretrained()
call when the merged model is being saved.Expected Behavior:
The merged model should save correctly without errors.
Actual Behavior:
The process fails during the save step due to the model having tensors that share memory. The error suggests using
save_model
to handle shared tensors more appropriately.Troubleshooting Attempts:
torch.save()
instead ofsafetensors
, which worked for saving but doesn’t resolve the root issue withmerge.py
.lm_head.weights
andtransformer.wte.embeddings
may be the shared tensors causing the problem.Request for Help:
Any guidance or suggestions to resolve this issue would be greatly appreciated!
Thank you for your time and help!
The text was updated successfully, but these errors were encountered: