You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
2 card Intel(R) Data Center GPU Max 1550 (aka PVC), note: each card has 2 tiles, in total there are 4 torch devices available
test_model_parallel_beam_search tests for the following models fail with "IndexError: list index out of range":
$ cat spec.py
import torch
DEVICE_NAME = 'xpu'
MANUAL_SEED_FN = torch.xpu.manual_seed
EMPTY_CACHE_FN = torch.xpu.empty_cache
DEVICE_COUNT_FN = torch.xpu.device_count
$ TRANSFORMERS_TEST_DEVICE_SPEC=spec.py python3 -m pytest -k test_model_parallel_beam_search \
tests/models/data2vec \
tests/models/roberta \
tests/models/roberta_prelayernorm \
tests/models/xlm_roberta_xl
...
FAILED tests/models/data2vec/test_modeling_data2vec_text.py::Data2VecTextModelTest::test_model_parallel_beam_search - IndexError: list index out of range
FAILED tests/models/roberta/test_modeling_roberta.py::RobertaModelTest::test_model_parallel_beam_search - IndexError: list index out of range
FAILED tests/models/roberta_prelayernorm/test_modeling_roberta_prelayernorm.py::RobertaPreLayerNormModelTest::test_model_parallel_beam_search - IndexError: list index out of range
FAILED tests/models/xlm_roberta_xl/test_modeling_xlm_roberta_xl.py::XLMRobertaXLModelTest::test_model_parallel_beam_search - IndexError: list index out of range
Failures in all failing cases are similar. Here is a full log for one of them:
$ TRANSFORMERS_TEST_DEVICE_SPEC=spec.py python3 -m pytest tests/models/data2vec/test_modeling_data2vec_text.py::Data2VecTextModelTest::test_model_parallel_beam_search
======================================= test session starts ========================================
platform linux -- Python 3.10.12, pytest-7.4.4, pluggy-1.5.0
rootdir: /home/dvrogozh/git/huggingface/transformers
configfile: pyproject.toml
plugins: anyio-4.8.0, rich-0.2.0, subtests-0.14.1, xdist-3.6.1, asyncio-0.23.8, timeout-2.3.1, hypothesis-6.122.3, reportlog-0.4.0, dash-2.18.2
asyncio: mode=strict
collected 1 item
tests/models/data2vec/test_modeling_data2vec_text.py F [100%]
============================================= FAILURES =============================================
______________________ Data2VecTextModelTest.test_model_parallel_beam_search _______________________
self = <tests.models.data2vec.test_modeling_data2vec_text.Data2VecTextModelTest testMethod=test_model_parallel_beam_search>
@require_accelerate
@require_torch_multi_accelerator
@pytest.mark.generate
def test_model_parallel_beam_search(self):
if "xpu" in torch_device:
if not (is_ipex_available("2.5") or version.parse(torch.__version__) >= version.parse("2.6")):
self.skipTest(reason="device_map='auto' does not work with XPU devices")
for model_class in self.all_generative_model_classes:
if model_class._no_split_modules is None:
continue
config, inputs_dict = self.prepare_config_and_inputs_for_generate()
model = model_class(config).eval()
with tempfile.TemporaryDirectory() as tmp_dir:
model.cpu().save_pretrained(tmp_dir)
> new_model = model_class.from_pretrained(tmp_dir, device_map="auto")
tests/generation/test_utils.py:693:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
src/transformers/modeling_utils.py:4195: in from_pretrained
device_map = infer_auto_device_map(model, dtype=target_dtype, **device_map_kwargs)
../accelerate/src/accelerate/utils/modeling.py:1368: in infer_auto_device_map
module_size_with_ties, tied_module_names, tied_modules = get_module_size_with_ties(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tied_params = ['lm_head.decoder.bias'], module_size = 396
module_sizes = defaultdict(<class 'int'>, {'': 147892, 'data2vec_text': 143016, 'data2vec_text.embeddings': 88704, 'data2vec_text.emb...sition_ids': 4096, 'data2vec_text.embeddings.token_type_ids': 4096, 'lm_head.decoder': 0, 'lm_head.decoder.weight': 0})
modules_to_treat = [('lm_head.dense', Linear(in_features=32, out_features=32, bias=True)), ('lm_head.layer_norm', LayerNorm((32,), eps=1e-12, elementwise_affine=True))]
def get_module_size_with_ties(
tied_params,
module_size,
module_sizes,
modules_to_treat,
) -> Tuple[int, List[str], List[nn.Module]]:
"""
Calculate the total size of a module, including its tied parameters.
Args:
tied_params (`List[str]`): The list of tied parameters.
module_size (`int`): The size of the module without tied parameters.
module_sizes (`Dict[str, int]`): A dictionary mapping each layer name to its size.
modules_to_treat (`List[Tuple[str, nn.Module]]`): The list of named modules to treat.
Returns:
`Tuple[int, List[str], List[nn.Module]]`: The total size of the module, the names of the tied modules, and the
tied modules.
"""
if len(tied_params) < 1:
return module_size, [], []
tied_module_names = []
tied_modules = []
for tied_param in tied_params:
> tied_module_index = [i for i, (n, _) in enumerate(modules_to_treat) if tied_param.startswith(n + ".")][0]
E IndexError: list index out of range
../accelerate/src/accelerate/utils/modeling.py:1129: IndexError
--------------------------------------- Captured stderr call ---------------------------------------
If you want to use `Data2VecTextLMHeadModel` as a standalone, add `is_decoder=True.`
If you want to use `Data2VecTextLMHeadModel` as a standalone, add `is_decoder=True.`
===================================== short test summary info ======================================
FAILED tests/models/data2vec/test_modeling_data2vec_text.py::Data2VecTextModelTest::test_model_parallel_beam_search - IndexError: list index out of range
======================================== 1 failed in 2.65s =========================================
Observations:
Failures are sensitive to a number of GPUs across which device_map=auto works. Issue happens with 4 XPU devices. Issue does not happen with XPU devices (run with ZE_AFFINITY_MASK=0,1).
This calculation goes off:
tied_param=lm_head.decoder.bias
modules_to_treat=[('lm_head.dense', Linear(in_features=32, out_features=32, bias=True)), ('lm_head.layer_norm', LayerNorm((32,), eps=1e-12, elementwise_affine=True))]
# which gives:
[i for i, (n, _) in enumerate(modules_to_treat) if tied_param.startswith(n + ".")]=[]
# and taking index `[0]` eventually does not work
With:
On:
test_model_parallel_beam_search
tests for the following models fail with "IndexError: list index out of range":Failures in all failing cases are similar. Here is a full log for one of them:
Observations:
device_map=auto
works. Issue happens with 4 XPU devices. Issue does not happen with XPU devices (run withZE_AFFINITY_MASK=0,1
).CC: @SunMarc @ydshieh @faaany
The text was updated successfully, but these errors were encountered: