Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi-gpu: test_model_parallel_beam_search tests fail with "IndexError: list index out of range" #35824

Open
dvrogozh opened this issue Jan 21, 2025 · 2 comments

Comments

@dvrogozh
Copy link
Contributor

With:

On:

  • 2 card Intel(R) Data Center GPU Max 1550 (aka PVC), note: each card has 2 tiles, in total there are 4 torch devices available

test_model_parallel_beam_search tests for the following models fail with "IndexError: list index out of range":

$ cat spec.py
import torch
DEVICE_NAME = 'xpu'
MANUAL_SEED_FN = torch.xpu.manual_seed
EMPTY_CACHE_FN = torch.xpu.empty_cache
DEVICE_COUNT_FN = torch.xpu.device_count

$ TRANSFORMERS_TEST_DEVICE_SPEC=spec.py python3 -m pytest -k test_model_parallel_beam_search \
  tests/models/data2vec \
  tests/models/roberta \
  tests/models/roberta_prelayernorm \
  tests/models/xlm_roberta_xl
...
FAILED tests/models/data2vec/test_modeling_data2vec_text.py::Data2VecTextModelTest::test_model_parallel_beam_search - IndexError: list index out of range
FAILED tests/models/roberta/test_modeling_roberta.py::RobertaModelTest::test_model_parallel_beam_search - IndexError: list index out of range
FAILED tests/models/roberta_prelayernorm/test_modeling_roberta_prelayernorm.py::RobertaPreLayerNormModelTest::test_model_parallel_beam_search - IndexError: list index out of range
FAILED tests/models/xlm_roberta_xl/test_modeling_xlm_roberta_xl.py::XLMRobertaXLModelTest::test_model_parallel_beam_search - IndexError: list index out of range

Failures in all failing cases are similar. Here is a full log for one of them:

$ TRANSFORMERS_TEST_DEVICE_SPEC=spec.py python3 -m pytest tests/models/data2vec/test_modeling_data2vec_text.py::Data2VecTextModelTest::test_model_parallel_beam_search
======================================= test session starts ========================================
platform linux -- Python 3.10.12, pytest-7.4.4, pluggy-1.5.0
rootdir: /home/dvrogozh/git/huggingface/transformers
configfile: pyproject.toml
plugins: anyio-4.8.0, rich-0.2.0, subtests-0.14.1, xdist-3.6.1, asyncio-0.23.8, timeout-2.3.1, hypothesis-6.122.3, reportlog-0.4.0, dash-2.18.2
asyncio: mode=strict
collected 1 item

tests/models/data2vec/test_modeling_data2vec_text.py F                                       [100%]

============================================= FAILURES =============================================
______________________ Data2VecTextModelTest.test_model_parallel_beam_search _______________________

self = <tests.models.data2vec.test_modeling_data2vec_text.Data2VecTextModelTest testMethod=test_model_parallel_beam_search>

    @require_accelerate
    @require_torch_multi_accelerator
    @pytest.mark.generate
    def test_model_parallel_beam_search(self):
        if "xpu" in torch_device:
            if not (is_ipex_available("2.5") or version.parse(torch.__version__) >= version.parse("2.6")):
                self.skipTest(reason="device_map='auto' does not work with XPU devices")

        for model_class in self.all_generative_model_classes:
            if model_class._no_split_modules is None:
                continue

            config, inputs_dict = self.prepare_config_and_inputs_for_generate()

            model = model_class(config).eval()
            with tempfile.TemporaryDirectory() as tmp_dir:
                model.cpu().save_pretrained(tmp_dir)
>               new_model = model_class.from_pretrained(tmp_dir, device_map="auto")

tests/generation/test_utils.py:693:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
src/transformers/modeling_utils.py:4195: in from_pretrained
    device_map = infer_auto_device_map(model, dtype=target_dtype, **device_map_kwargs)
../accelerate/src/accelerate/utils/modeling.py:1368: in infer_auto_device_map
    module_size_with_ties, tied_module_names, tied_modules = get_module_size_with_ties(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

tied_params = ['lm_head.decoder.bias'], module_size = 396
module_sizes = defaultdict(<class 'int'>, {'': 147892, 'data2vec_text': 143016, 'data2vec_text.embeddings': 88704, 'data2vec_text.emb...sition_ids': 4096, 'data2vec_text.embeddings.token_type_ids': 4096, 'lm_head.decoder': 0, 'lm_head.decoder.weight': 0})
modules_to_treat = [('lm_head.dense', Linear(in_features=32, out_features=32, bias=True)), ('lm_head.layer_norm', LayerNorm((32,), eps=1e-12, elementwise_affine=True))]

    def get_module_size_with_ties(
        tied_params,
        module_size,
        module_sizes,
        modules_to_treat,
    ) -> Tuple[int, List[str], List[nn.Module]]:
        """
        Calculate the total size of a module, including its tied parameters.

        Args:
            tied_params (`List[str]`): The list of tied parameters.
            module_size (`int`): The size of the module without tied parameters.
            module_sizes (`Dict[str, int]`): A dictionary mapping each layer name to its size.
            modules_to_treat (`List[Tuple[str, nn.Module]]`): The list of named modules to treat.

        Returns:
            `Tuple[int, List[str], List[nn.Module]]`: The total size of the module, the names of the tied modules, and the
            tied modules.
        """
        if len(tied_params) < 1:
            return module_size, [], []
        tied_module_names = []
        tied_modules = []

        for tied_param in tied_params:
>           tied_module_index = [i for i, (n, _) in enumerate(modules_to_treat) if tied_param.startswith(n + ".")][0]
E           IndexError: list index out of range

../accelerate/src/accelerate/utils/modeling.py:1129: IndexError
--------------------------------------- Captured stderr call ---------------------------------------
If you want to use `Data2VecTextLMHeadModel` as a standalone, add `is_decoder=True.`
If you want to use `Data2VecTextLMHeadModel` as a standalone, add `is_decoder=True.`
===================================== short test summary info ======================================
FAILED tests/models/data2vec/test_modeling_data2vec_text.py::Data2VecTextModelTest::test_model_parallel_beam_search - IndexError: list index out of range
======================================== 1 failed in 2.65s =========================================

Observations:

  1. Failures are sensitive to a number of GPUs across which device_map=auto works. Issue happens with 4 XPU devices. Issue does not happen with XPU devices (run with ZE_AFFINITY_MASK=0,1).
  2. This calculation goes off:
tied_param=lm_head.decoder.bias
modules_to_treat=[('lm_head.dense', Linear(in_features=32, out_features=32, bias=True)), ('lm_head.layer_norm', LayerNorm((32,), eps=1e-12, elementwise_affine=True))]
# which gives:
[i for i, (n, _) in enumerate(modules_to_treat) if tied_param.startswith(n + ".")]=[]
# and taking index `[0]` eventually does not work

CC: @SunMarc @ydshieh @faaany

@Rocketknight1
Copy link
Member

Seems like a generation thing, so cc @gante

@ydshieh
Copy link
Collaborator

ydshieh commented Jan 23, 2025

The error occurs at

../accelerate/src/accelerate/utils/modeling.py:1368: in infer_auto_device_map
module_size_with_ties, tied_module_names, tied_modules = get_module_size_with_ties(

So probably not for @gante but @SunMarc ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants