You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While attempting to fine-tune llama-3 70b I ran into an OOM issue in the merge step. This left me in the situation where the LoRA adapter was trained and saved on the Modal volume but not merged yet, so I couldn't use the inference script. I don't want to run the training again. A similar situation occurs when no-merge-lora is used during training.
How about a new argument to launch in train.py called merge_only that allows merging the adapter of a resumed run? This setting would only make sense if a run_to_resume is passed.
Alternatively, an explanation in the readme of how to call the merge function separately from train.py could help others in the situation.
This worked for me:
merge.py
from .commonimportappfrom .trainimportmerge@app.local_entrypoint()defrun_merge(run_folder, output_dir):
merge.remote(run_folder, output_dir)
and to run
modal run src.merge --run-folder /runs/<run-name> --output-dir lora-out
On the merge OOM error: solved by setting CUDA_VISIBLE_DEVICES = "" so it merges in system RAM
The text was updated successfully, but these errors were encountered:
While attempting to fine-tune llama-3 70b I ran into an OOM issue in the merge step. This left me in the situation where the LoRA adapter was trained and saved on the Modal volume but not merged yet, so I couldn't use the inference script. I don't want to run the training again. A similar situation occurs when
no-merge-lora
is used during training.How about a new argument to
launch
intrain.py
calledmerge_only
that allows merging the adapter of a resumed run? This setting would only make sense if arun_to_resume
is passed.Alternatively, an explanation in the readme of how to call the
merge
function separately fromtrain.py
could help others in the situation.This worked for me:
merge.py
and to run
On the merge OOM error: solved by setting
CUDA_VISIBLE_DEVICES = ""
so it merges in system RAMThe text was updated successfully, but these errors were encountered: