Text attribution fails for XLM-Roberta models #123

nishantgurunath · 2023-01-19T23:48:24Z

Issue:

I was testing the package on huggingface "xlm-roberta-base" model and it failed with the following error.
IndexError: index out of range in self

Here's how to reproduce the error:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers_interpret import SequenceClassificationExplainer

tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")
model = AutoModelForSequenceClassification.from_pretrained(
    "xlm-roberta-base"
)
multiclass_explainer = SequenceClassificationExplainer(model=model, tokenizer=tokenizer)
word_attributions = multiclass_explainer(text="hello world")

The text was updated successfully, but these errors were encountered:

MarianneAK · 2023-01-26T20:07:15Z

I'm getting the same error also on CamemBERT (which is based on RoBERTa).
Really hope we get a fix to this

ameshrky · 2023-03-10T17:41:21Z

@nishantgurunath , Have you managed to solve this issue?

I am getting the same error on XLNet base.

Here is how to reproduce the error:

from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_name = "xlnet-base-cased"
model = AutoModelForSequenceClassification.from_pretrained(model_name,num_labels=2).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name,device = device)

from transformers_interpret import SequenceClassificationExplainer
cls_explainer = SequenceClassificationExplainer(
model,
tokenizer)

word_attributions = cls_explainer("I love you and I like you.")

Lavriz · 2023-07-31T12:21:32Z

Having the same issue for the fine-tuned xlm-roberta: IndexError: index out of range in self

Tomik292 · 2023-08-07T15:30:59Z

I've managed to solve the problem it seems.
The problem is that xlm-roberta-like models have (token_type_embeddings): Embedding(1, 1024)
Meaning token_type_ids vector can be of max_length = 1024 and only consist of values 0.

But in the code of explainer.py in the function

def _make_input_reference_token_type_pair(
        self, input_ids: torch.Tensor, sep_idx: int = 0
    ) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        Returns two tensors indicating the corresponding token types for the `input_ids`
        and a corresponding all zero reference token type tensor.
        Args:
            input_ids (torch.Tensor): Tensor of text converted to `input_ids`
            sep_idx (int, optional):  Defaults to 0.

        Returns:
            Tuple[torch.Tensor, torch.Tensor]
        """
        seq_len = input_ids.size(1)
        token_type_ids = torch.tensor([0 if i <= sep_idx else 1 for i in range(seq_len)], device=self.device).expand_as(
            input_ids
        )
        ref_token_type_ids = torch.zeros_like(token_type_ids, device=self.device).expand_as(input_ids)

        return (token_type_ids, ref_token_type_ids)

Tensor is created with 0, except for the sep_idx position (in my case it was the last token), where it is 1. So just change the function to something like this.

def _make_input_reference_token_type_pair(
        self, input_ids: torch.Tensor, sep_idx: int = 0
    ) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        Returns two tensors indicating the corresponding token types for the `input_ids`
        and a corresponding all zero reference token type tensor.
        Args:
            input_ids (torch.Tensor): Tensor of text converted to `input_ids`
            sep_idx (int, optional):  Defaults to 0.

        Returns:
            Tuple[torch.Tensor, torch.Tensor]
        """
        seq_len = input_ids.size(1)
        
        if self.model.config.model_type == 'xlm-roberta':
            token_type_ids = torch.zeros(seq_len, dtype=torch.int, device=self.device).expand_as(input_ids)
        else:
            token_type_ids = torch.tensor([0 if i <= sep_idx else 1 for i in range(seq_len)], device=self.device).expand_as(
                input_ids
            )
        ref_token_type_ids = torch.zeros_like(token_type_ids, device=self.device).expand_as(input_ids)

        return (token_type_ids, ref_token_type_ids)

halesfi · 2024-09-22T15:47:47Z

Can this small change be promoted to the main track please?

I have now spent half day looking for a problem until I found this unclosed issue which solves otherwise mysterious CUDA errors.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text attribution fails for XLM-Roberta models #123

Text attribution fails for XLM-Roberta models #123

nishantgurunath commented Jan 19, 2023

MarianneAK commented Jan 26, 2023

ameshrky commented Mar 10, 2023 •

edited

Loading

Lavriz commented Jul 31, 2023

Tomik292 commented Aug 7, 2023

halesfi commented Sep 22, 2024

Text attribution fails for XLM-Roberta models #123

Text attribution fails for XLM-Roberta models #123

Comments

nishantgurunath commented Jan 19, 2023

Issue:

Here's how to reproduce the error:

MarianneAK commented Jan 26, 2023

ameshrky commented Mar 10, 2023 • edited Loading

Here is how to reproduce the error:

Lavriz commented Jul 31, 2023

Tomik292 commented Aug 7, 2023

halesfi commented Sep 22, 2024

ameshrky commented Mar 10, 2023 •

edited

Loading