[Issue]: RuntimeError: Expected dout_seq_stride == out_seq_stride to be true, but got false. #41

donglixp · 2024-02-03T02:44:35Z

Problem Description

File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/fairseq/optim/fp16_optimizer.py", line 393, in backward
loss.backward()
File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward
torch.autograd.backward(
File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/autograd/init.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/autograd/function.py", line 288, in apply
return user_fn(self, *args)
File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/fairscale/nn/checkpoint/checkpoint_activations.py", line 311, in backward
torch.autograd.backward(outputs_with_grad, args_with_grad)
File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/autograd/init.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/autograd/function.py", line 288, in apply
return user_fn(self, *args)
File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/flash_attn/flash_attn_interface.py", line 236, in backward
_flash_attn_backward(
File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/flash_attn/flash_attn_interface.py", line 66, in _flash_attn_backward
dq, dk, dv, softmax_d, = flash_attn_cuda.bwd(
RuntimeError: Expected dout_seq_stride == out_seq_stride to be true, but got false. (Could this error message be improved? If so, please report an enhancement request to PyTorch.)

Operating System

20.04.6 LTS (Focal Fossa)

CPU

AMD EPYC 7V12 64-Core Processor

GPU

AMD Instinct MI250X

ROCm Version

ROCm 6.0.0, ROCm 5.7.1

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

donglixp · 2024-02-04T02:53:07Z

                q = rearrange(q, '(b h) l d -> b l h d', b=bsz).contiguous()
                k = rearrange(k, '(b h) l d -> b l h d', b=bsz).contiguous()
                v = rearrange(v, '(b h) l d -> b l h d', b=bsz).contiguous()
                print(q.shape)
                print(k.shape)
                print(v.shape)
                attn = flash_attn_func(q, k, v, causal=is_causal)
                attn = rearrange(attn, 'b l h d -> (b h) l d')

The error message:

torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
torch.Size([2, 2048, 48, 64])
Traceback (most recent call last):
  File "/tmp/amlt-code-download/fairseq/train.py", line 14, in <module>
    cli_main()
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/fairseq_cli/train.py", line 543, in cli_main
    distributed_utils.call_main(cfg, main)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/fairseq/distributed/utils.py", line 365, in call_main
    distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/fairseq/distributed/utils.py", line 339, in distributed_main
    main(cfg, **kwargs)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/fairseq_cli/train.py", line 191, in main
    valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
  File "/opt/conda/envs/py_3.9/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/fairseq_cli/train.py", line 307, in train
    log_output = trainer.train_step(samples)
  File "/opt/conda/envs/py_3.9/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/fairseq/trainer.py", line 850, in train_step
    raise e
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/fairseq/trainer.py", line 818, in train_step
    loss, sample_size_i, logging_output = self.task.train_step(
  File "/tmp/amlt-code-download/fairseq/tasks/gpt.py", line 253, in train_step
    optimizer.backward(loss)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/fairseq/optim/fp16_optimizer.py", line 393, in backward
    loss.backward()
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/autograd/function.py", line 288, in apply
    return user_fn(self, *args)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/fairscale/nn/checkpoint/checkpoint_activations.py", line 311, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/autograd/function.py", line 288, in apply
    return user_fn(self, *args)
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/flash_attn/flash_attn_interface.py", line 236, in backward
    _flash_attn_backward(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/flash_attn/flash_attn_interface.py", line 66, in _flash_attn_backward
    dq, dk, dv, softmax_d, = flash_attn_cuda.bwd(
RuntimeError: Expected dout_seq_stride == out_seq_stride to be true, but got false.

donglixp · 2024-02-04T02:55:55Z

The log shows the program runs well for steps and triggers the bug, rather than encountering the error at the first calling.

donglixp · 2024-02-04T02:58:32Z

I tried both rocm-5.7/6.0 dockers.

donglixp · 2024-02-04T03:45:41Z

The bug is related to the qkv shape:

[1, 2048, 48, 64]: works well

[2, 2048, 48, 64]: triggers the bug

[4, 2048, 48, 64]: triggers the bug

[1, 2048, 24, 128]: triggers the bug

[2, 2048, 24, 128]: triggers the bug

[2, 2048, 25, 128]: triggers the bug

[2, 2048, 24, 124]: works well

[2, 2048, 48, 62]: works well

jayz0123 · 2024-02-04T21:25:19Z

@donglixp Can I have the script you are running?

donglixp · 2024-02-05T03:00:30Z

@howiejayz
VM ROCM 5.6.0
Docker: rocm/pytorch:rocm6.0_ubuntu20.04_py3.9_pytorch_2.1.1

Forward and backward with the following shapes:

[2, 2048, 48, 64]: triggers the bug

[4, 2048, 48, 64]: triggers the bug

[1, 2048, 24, 128]: triggers the bug

[2, 2048, 24, 128]: triggers the bug

[2, 2048, 25, 128]: triggers the bug

donglixp · 2024-02-05T03:01:17Z

                q = rearrange(q, '(b h) l d -> b l h d', b=bsz).contiguous()
                k = rearrange(k, '(b h) l d -> b l h d', b=bsz).contiguous()
                v = rearrange(v, '(b h) l d -> b l h d', b=bsz).contiguous()
                attn = flash_attn_func(q, k, v, causal=is_causal)
                attn = rearrange(attn, 'b l h d -> (b h) l d')

donglixp · 2024-02-05T03:04:00Z

Although using [2, 2048, 48, 62] didn't trigger the above error. I found that the job encountered loss divergence issues, while a similar recipe ran successfully before (when the VM ROCM is 5.4 and docker is 5.7).

jayz0123 · 2024-02-05T05:35:01Z

The error seems to be triggered when dout is not contiguous. May I ask how do you generate the dout when passing to the backward?

donglixp · 2024-02-05T07:56:20Z

@howiejayz Yes, they were. The contiguous() was also handled at

flash-attention/flash_attn/flash_attn_interface.py

Line 65 in 68aac13

dout, q, k, v, out = [maybe_contiguous(x) for x in (dout, q, k, v, out)]

jayz0123 · 2024-02-06T08:58:28Z

@donglixp. Thanks. Can I also have the information of the repo and branch you are testing? So I can reproduce your result and see what step goes wrong.

rocking5566 · 2024-07-01T14:13:53Z

We change the backend of flash attention 2in the branch of ck_tile
I also submit an PR to support AMD / ROCm on FlashAttention 2
Dao-AILab#1010
This PR using composable_kernel as backend
I hope this may solve your issue

zixian-wang-amd · 2024-07-01T15:46:45Z

We change the backend of flash attention 2in the branch of ck_tile I also submit an PR to support AMD / ROCm on FlashAttention 2 Dao-AILab#1010 This PR using composable_kernel as backend I hope this may solve your issue

Will I be able to run on other models that used Flash-Attention-2 on Instinct GPUs if the PR is not merged yet? Btw, what is your working email? I can't find your name in Team.

rocking5566 · 2024-07-01T16:27:53Z

We change the backend of flash attention 2in the branch of ck_tile I also submit an PR to support AMD / ROCm on FlashAttention 2 Dao-AILab#1010 This PR using composable_kernel as backend I hope this may solve your issue

Will I be able to run on other models that used Flash-Attention-2 on Instinct GPUs if the PR is not merged yet? Btw, what is your working email? I can't find your name in Team.

[email protected]

carlushuang · 2024-07-17T13:50:28Z

cc @danyao12

donglixp mentioned this issue Feb 3, 2024

[Issue]: Expected dout_seq_stride == out_seq_stride to be true, but got false #40

Open

jayz0123 self-assigned this Feb 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue]: RuntimeError: Expected dout_seq_stride == out_seq_stride to be true, but got false. #41

[Issue]: RuntimeError: Expected dout_seq_stride == out_seq_stride to be true, but got false. #41

donglixp commented Feb 3, 2024

donglixp commented Feb 4, 2024

donglixp commented Feb 4, 2024

donglixp commented Feb 4, 2024

donglixp commented Feb 4, 2024 •

edited

Loading

jayz0123 commented Feb 4, 2024

donglixp commented Feb 5, 2024

donglixp commented Feb 5, 2024

donglixp commented Feb 5, 2024

jayz0123 commented Feb 5, 2024

donglixp commented Feb 5, 2024

jayz0123 commented Feb 6, 2024 •

edited

Loading

rocking5566 commented Jul 1, 2024

zixian-wang-amd commented Jul 1, 2024

rocking5566 commented Jul 1, 2024

carlushuang commented Jul 17, 2024

[Issue]: RuntimeError: Expected dout_seq_stride == out_seq_stride to be true, but got false. #41

[Issue]: RuntimeError: Expected dout_seq_stride == out_seq_stride to be true, but got false. #41

Comments

donglixp commented Feb 3, 2024

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

donglixp commented Feb 4, 2024

donglixp commented Feb 4, 2024

donglixp commented Feb 4, 2024

donglixp commented Feb 4, 2024 • edited Loading

jayz0123 commented Feb 4, 2024

donglixp commented Feb 5, 2024

donglixp commented Feb 5, 2024

donglixp commented Feb 5, 2024

jayz0123 commented Feb 5, 2024

donglixp commented Feb 5, 2024

jayz0123 commented Feb 6, 2024 • edited Loading

rocking5566 commented Jul 1, 2024

zixian-wang-amd commented Jul 1, 2024

rocking5566 commented Jul 1, 2024

carlushuang commented Jul 17, 2024

donglixp commented Feb 4, 2024 •

edited

Loading

jayz0123 commented Feb 6, 2024 •

edited

Loading