multi-gpu: test_model_parallel_beam_search tests fail with "IndexError: list index out of range" #35824

dvrogozh · 2025-01-21T22:52:06Z

With:

Transformers: 7d4b3dd
huggingface/accelerate@78b8126
pytorch/pytorch@4e4b859 (torch 2.7 candidate)

On:

2 card Intel(R) Data Center GPU Max 1550 (aka PVC), note: each card has 2 tiles, in total there are 4 torch devices available

test_model_parallel_beam_search tests for the following models fail with "IndexError: list index out of range":

$ cat spec.py
import torch
DEVICE_NAME = 'xpu'
MANUAL_SEED_FN = torch.xpu.manual_seed
EMPTY_CACHE_FN = torch.xpu.empty_cache
DEVICE_COUNT_FN = torch.xpu.device_count

$ TRANSFORMERS_TEST_DEVICE_SPEC=spec.py python3 -m pytest -k test_model_parallel_beam_search \
  tests/models/data2vec \
  tests/models/roberta \
  tests/models/roberta_prelayernorm \
  tests/models/xlm_roberta_xl
...
FAILED tests/models/data2vec/test_modeling_data2vec_text.py::Data2VecTextModelTest::test_model_parallel_beam_search - IndexError: list index out of range
FAILED tests/models/roberta/test_modeling_roberta.py::RobertaModelTest::test_model_parallel_beam_search - IndexError: list index out of range
FAILED tests/models/roberta_prelayernorm/test_modeling_roberta_prelayernorm.py::RobertaPreLayerNormModelTest::test_model_parallel_beam_search - IndexError: list index out of range
FAILED tests/models/xlm_roberta_xl/test_modeling_xlm_roberta_xl.py::XLMRobertaXLModelTest::test_model_parallel_beam_search - IndexError: list index out of range

Failures in all failing cases are similar. Here is a full log for one of them:

$ TRANSFORMERS_TEST_DEVICE_SPEC=spec.py python3 -m pytest tests/models/data2vec/test_modeling_data2vec_text.py::Data2VecTextModelTest::test_model_parallel_beam_search
======================================= test session starts ========================================
platform linux -- Python 3.10.12, pytest-7.4.4, pluggy-1.5.0
rootdir: /home/dvrogozh/git/huggingface/transformers
configfile: pyproject.toml
plugins: anyio-4.8.0, rich-0.2.0, subtests-0.14.1, xdist-3.6.1, asyncio-0.23.8, timeout-2.3.1, hypothesis-6.122.3, reportlog-0.4.0, dash-2.18.2
asyncio: mode=strict
collected 1 item

tests/models/data2vec/test_modeling_data2vec_text.py F                                       [100%]

============================================= FAILURES =============================================
______________________ Data2VecTextModelTest.test_model_parallel_beam_search _______________________

self = <tests.models.data2vec.test_modeling_data2vec_text.Data2VecTextModelTest testMethod=test_model_parallel_beam_search>

    @require_accelerate
    @require_torch_multi_accelerator
    @pytest.mark.generate
    def test_model_parallel_beam_search(self):
        if "xpu" in torch_device:
            if not (is_ipex_available("2.5") or version.parse(torch.__version__) >= version.parse("2.6")):
                self.skipTest(reason="device_map='auto' does not work with XPU devices")

        for model_class in self.all_generative_model_classes:
            if model_class._no_split_modules is None:
                continue

            config, inputs_dict = self.prepare_config_and_inputs_for_generate()

            model = model_class(config).eval()
            with tempfile.TemporaryDirectory() as tmp_dir:
                model.cpu().save_pretrained(tmp_dir)
>               new_model = model_class.from_pretrained(tmp_dir, device_map="auto")

tests/generation/test_utils.py:693:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
src/transformers/modeling_utils.py:4195: in from_pretrained
    device_map = infer_auto_device_map(model, dtype=target_dtype, **device_map_kwargs)
../accelerate/src/accelerate/utils/modeling.py:1368: in infer_auto_device_map
    module_size_with_ties, tied_module_names, tied_modules = get_module_size_with_ties(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

tied_params = ['lm_head.decoder.bias'], module_size = 396
module_sizes = defaultdict(<class 'int'>, {'': 147892, 'data2vec_text': 143016, 'data2vec_text.embeddings': 88704, 'data2vec_text.emb...sition_ids': 4096, 'data2vec_text.embeddings.token_type_ids': 4096, 'lm_head.decoder': 0, 'lm_head.decoder.weight': 0})
modules_to_treat = [('lm_head.dense', Linear(in_features=32, out_features=32, bias=True)), ('lm_head.layer_norm', LayerNorm((32,), eps=1e-12, elementwise_affine=True))]

    def get_module_size_with_ties(
        tied_params,
        module_size,
        module_sizes,
        modules_to_treat,
    ) -> Tuple[int, List[str], List[nn.Module]]:
        """
        Calculate the total size of a module, including its tied parameters.

        Args:
            tied_params (`List[str]`): The list of tied parameters.
            module_size (`int`): The size of the module without tied parameters.
            module_sizes (`Dict[str, int]`): A dictionary mapping each layer name to its size.
            modules_to_treat (`List[Tuple[str, nn.Module]]`): The list of named modules to treat.

        Returns:
            `Tuple[int, List[str], List[nn.Module]]`: The total size of the module, the names of the tied modules, and the
            tied modules.
        """
        if len(tied_params) < 1:
            return module_size, [], []
        tied_module_names = []
        tied_modules = []

        for tied_param in tied_params:
>           tied_module_index = [i for i, (n, _) in enumerate(modules_to_treat) if tied_param.startswith(n + ".")][0]
E           IndexError: list index out of range

../accelerate/src/accelerate/utils/modeling.py:1129: IndexError
--------------------------------------- Captured stderr call ---------------------------------------
If you want to use `Data2VecTextLMHeadModel` as a standalone, add `is_decoder=True.`
If you want to use `Data2VecTextLMHeadModel` as a standalone, add `is_decoder=True.`
===================================== short test summary info ======================================
FAILED tests/models/data2vec/test_modeling_data2vec_text.py::Data2VecTextModelTest::test_model_parallel_beam_search - IndexError: list index out of range
======================================== 1 failed in 2.65s =========================================

Observations:

Failures are sensitive to a number of GPUs across which device_map=auto works. Issue happens with 4 XPU devices. Issue does not happen with XPU devices (run with ZE_AFFINITY_MASK=0,1).
This calculation goes off:

tied_param=lm_head.decoder.bias
modules_to_treat=[('lm_head.dense', Linear(in_features=32, out_features=32, bias=True)), ('lm_head.layer_norm', LayerNorm((32,), eps=1e-12, elementwise_affine=True))]
# which gives:
[i for i, (n, _) in enumerate(modules_to_treat) if tied_param.startswith(n + ".")]=[]
# and taking index `[0]` eventually does not work

CC: @SunMarc @ydshieh @faaany

The text was updated successfully, but these errors were encountered:

Rocketknight1 · 2025-01-22T15:24:52Z

Seems like a generation thing, so cc @gante

ydshieh · 2025-01-23T09:51:27Z

The error occurs at

../accelerate/src/accelerate/utils/modeling.py:1368: in infer_auto_device_map
module_size_with_ties, tied_module_names, tied_modules = get_module_size_with_ties(

So probably not for @gante but @SunMarc ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi-gpu: test_model_parallel_beam_search tests fail with "IndexError: list index out of range" #35824

multi-gpu: test_model_parallel_beam_search tests fail with "IndexError: list index out of range" #35824

dvrogozh commented Jan 21, 2025

Rocketknight1 commented Jan 22, 2025

ydshieh commented Jan 23, 2025

multi-gpu: test_model_parallel_beam_search tests fail with "IndexError: list index out of range" #35824

multi-gpu: test_model_parallel_beam_search tests fail with "IndexError: list index out of range" #35824

Comments

dvrogozh commented Jan 21, 2025

Rocketknight1 commented Jan 22, 2025

ydshieh commented Jan 23, 2025