Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flash attention and multipack failing for qwen and mistral #1966

Open
6 of 8 tasks
tiger241 opened this issue Oct 12, 2024 · 9 comments
Open
6 of 8 tasks

Flash attention and multipack failing for qwen and mistral #1966

tiger241 opened this issue Oct 12, 2024 · 9 comments
Assignees
Labels
bug Something isn't working

Comments

@tiger241
Copy link

tiger241 commented Oct 12, 2024

Please check that this issue hasn't been reported before.

  • I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

I can run the example config yamls in main without significant bugs

Current behaviour

First issue i saw was with multipack. I noticed that the feature introduced during auto_batch_size made the training code stay idle after a few steps (usually at the eval step after 1 training step).

I used the older version of multipack and it worked fine. Overall that newer version of multipack worked for single gpu but not multi-gpu.

Steps to reproduce

I tried running the code in a cuda 12.4 environment with torch 2.4.1 and flash attention at 2.6.3 (tirton 3.0.0).
(Older versions of axolotl worked without issue from August, maybe something switched and i am unable to find the cause)
examples/mistral/qlora.yml [ found the multipack bug here ]

I used this dataset
datasets:

  • path: teknium/GPT4-LLM-Cleaned
    type: alpaca

Config yaml

No response

Possible solution

Older versioning of multipack works. I am not sure of the compatibility of the new multipack within deepspeed at least. Maybe the communication added in that step interferes something in deepspeed (I tried zero 2 and zero3_bf16 config).

Which Operating Systems are you using?

  • Linux
  • macOS
  • Windows

Python Version

python 3.11

axolotl branch-commit

main

Acknowledgements

  • My issue title is concise, descriptive, and in title casing.
  • I have searched the existing issues to make sure this bug has not been reported yet.
  • I am using the latest version of axolotl.
  • I have provided enough information for the maintainers to reproduce and diagnose the issue.
@tiger241 tiger241 added the bug Something isn't working label Oct 12, 2024
@bursteratom bursteratom self-assigned this Oct 14, 2024
@NanoCode012
Copy link
Collaborator

May I ask what is auto_batch_size?

@bursteratom
Copy link
Collaborator

@NanoCode012 something like this: https://pytorch-lightning.readthedocs.io/en/1.1.1/training_tricks.html#auto-scaling-of-batch-size , I believe

@bursteratom
Copy link
Collaborator

bursteratom commented Oct 14, 2024

Just posting what I got so far here. I tested on 2 GPU setup and after it got stuck for like an hour I got the following error, which seems to be the standard timeout message. And that is before I even got to the first training step.

`frame #6: clone + 0x44 (0x7f021b924bf4 in /usr/lib/x86_64-linux-gnu/libc.so.6)

W1014 22:35:07.647000 138711449323328 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 618 closing signal SIGTERM
E1014 22:35:07.761000 138711449323328 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 0 (pid: 617) of binary: /root/miniconda3/en
vs/py3.11/bin/python
Traceback (most recent call last):
File "/root/miniconda3/envs/py3.11/bin/accelerate", line 8, in
sys.exit(main())
^^^^^^
File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1159, in launch_command
multi_gpu_launcher(args)
File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher
distrib_run.run(args)
File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

axolotl.cli.train FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-10-14_22:35:07
host : d2eeb895ba3d
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 617)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 617`

@bursteratom
Copy link
Collaborator

@tiger241 did you get a similar error message or does it just stay idle without further output?

@tiger241
Copy link
Author

tiger241 commented Oct 14, 2024

I got a different one....NCCL timeout in my case

I am referring to the changes in the multipack file that where introduced here. According to the commit title, its related to auto_batch_size
4e5400c#diff-26bb7717d9a9a9a1ae328e55ef90344b64f2d768a9f072cfafe80a0912537515

Inside the reduce_and_broadcast function (the broadcast operation did not go through)
I tested this by adding some print statements and noticed nothing happened at the broadcast function (like it would not go forward from there)

I have also tried the nccl commands suggested in the readme (the bug did not go away)

@bursteratom
Copy link
Collaborator

bursteratom commented Oct 15, 2024

@tiger241 Just to clarify, you were able to get it work on multi-GPU prior to the commit 4e5400c#diff-26bb7717d9a9a9a1ae328e55ef90344b64f2d768a9f072cfafe80a0912537515 ?

@tiger241
Copy link
Author

Yep...basically i just rewrote that multipack file to the previous version (before the commit) while having the rest of the code base in the same state as the current main branch and it worked fine.

The broadcast operation called in the multipack is somehow causing the issue. I just have no bug other than idle and nccl timeout for this.

Before the code was estimating the batch by assuming uniformity in the data distribution across the gpus . i am guessing this was added so that the deepspeed auto batch feature could be used, however that would require a precise value of the batch size per gpu

As you see the code....i used the _def len_est(self): function and not the self.gather_len_batches(len_batches) function . But also removed the gather operation inside _len_est to the way it was before that commit.

In that case there were no issues and it worked fine

@bursteratom
Copy link
Collaborator

@tiger241 Thank you for the clarification! We were able to get around the idling by setting eval_sample_packing to false. Can you try this and let us know if that temporary fix works for you?

@bursteratom
Copy link
Collaborator

bursteratom commented Oct 16, 2024

@tiger241 @NanoCode012 Started a PR that implemented a fix for this issue so that eval_sample_packing=True is unstuck.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants