Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: TypeError: 'Token' object is not subscriptable #3473

Open
kdk2612 opened this issue Jun 21, 2024 · 7 comments
Open

[Bug]: TypeError: 'Token' object is not subscriptable #3473

kdk2612 opened this issue Jun 21, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@kdk2612
Copy link

kdk2612 commented Jun 21, 2024

Describe the bug

Getting an error when trying to train NER model using custom dataset. This was working back in Dec 2023. I have trained a model using the same data and FLAIR version 0.13.1 but not sure what has changed since then.
I have padded sentence in the data, eg. "[PAD] X" is the data set if the size of the sentence is greater than 512 tokens.

I printed every span that the model is reading, and see that for some SPAN I am getting back the TOKEN object. I am not sure what is going on.

To Reproduce

import flair
print(flair.__version__)

import torch
print(torch.__version__)

print(torch.cuda.is_available())
from flair.data import Corpus
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings
from typing import List
import pandas as pd
from flair.models import SequenceTagger
from flair.data import Corpus
from flair.datasets import ColumnCorpus
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings
from typing import List
from flair.embeddings import WordEmbeddings, FlairEmbeddings, StackedEmbeddings ,FastTextEmbeddings, PooledFlairEmbeddings
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings
from typing import List
import logging
import sys
import time
from flair.data import Corpus
from flair.datasets import ColumnCorpus
from tqdm import tqdm
from flair.data import Corpus
from flair.datasets import ColumnCorpus
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings
from typing import List
from pathlib import Path

# define columns
columns = {0: 'text', 1: 'ner'}

data_folder = "Data/new_sample"

# init a corpus using column format, data folder and the names of the train, dev and test files
corpus: Corpus = ColumnCorpus(data_folder, columns,
                              train_file='train/flair_train_re_iobes.txt',
                              test_file='flair_test_re_iobes.txt',
                              dev_file='flair_val_re_iobes.txt',
                             in_memory=False)


# In[ ]:


from flair.embeddings import WordEmbeddings, FlairEmbeddings, StackedEmbeddings ,FastTextEmbeddings, PooledFlairEmbeddings
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings
from typing import List


# 2. what tag do we want to predict?
tag_type = 'ner'

# 3. make the tag dictionary from the corpus
tag_dictionary = corpus.make_label_dictionary(label_type=tag_type)
print(tag_dictionary.idx2item)
print(tag_dictionary.get_items())

# 4. initialize embeddings
# !unset LD_LIBRARY_PATH
embedding_fldr = "EmbeddingData/FreeTextLarge"
embedding_types: List[TokenEmbeddings] = [
    FlairEmbeddings(f'{embedding_fldr}/Fullset_Forward/best-lm.pt'),
    FlairEmbeddings(f'{embedding_fldr}/Fullset_Backward/best-lm.pt'),
]
embeddings: StackedEmbeddings = StackedEmbeddings(embeddings=embedding_types)

from flair.models import SequenceTagger
from flair.trainers import ModelTrainer

model_fldr = "NER/FreeText"
# 4. initialize sequence tagger
from flair.models import SequenceTagger

tagger: SequenceTagger = SequenceTagger(
    hidden_size=256,
    embeddings=embeddings,
    tag_dictionary=tag_dictionary,
    tag_type="ner",
    use_crf=True,
    use_rnn=True,
    rnn_layers=1,
    word_dropout = 0.05,
    locked_dropout=0.5,
    train_initial_hidden_state = False,
    rnn_type = 'LSTM'
)

# 5. initialize trainer
from flair.trainers import ModelTrainer

trainer: ModelTrainer = ModelTrainer(tagger, corpus)

trainer.train(
    base_path=f'{model_fldr}/NER_EW_all_2',
    train_with_dev=True,
    max_epochs=2,
    learning_rate=0.1,
    mini_batch_size=64,
    monitor_test=True,
    embeddings_storage_mode="cpu",
)```


### Expected behavior

Would train the model properly using the CONLL data format 

### Logs and Stack traces

```stacktrace
This is the SPAN "Span| 35:38]: "U. s" - geography (1.0)"
Span|287:290]: "U. S"- geography' (1.0)
Span[287:290]: "U. S"geography (1.0)
3
This is the SPAN "Span[287:290]: "U . S"
geography (1.0)"
Span| 302:305]: "U • S" - geography (1.0)
Span[302:305]: "U • S" - geography (1.0)
3
This is the SPAN "Span|302:305]: "U. S"
- geography (1.0)"
Span|508:512]: "[PAD] [PAD] IPADI [PAD]"Span[508:512]: "[PAD] [PAD] [PAD] [PAD]" → (1.0)
4
This is the SPAN "Span[508:512]: "[PAD] [PAD] [PAD]
IPADI" → (1.0)"
Token [186]: "D"B-geography (1.0)
Token [186]: "D" - B-geography (1.0)
1
This is the SPAN "Token( 186]: "D" - B-geography (1.0)"Leng is 1: {span}
File "/home/ec2-user/SageMaker/flairtrainnerFreeTextERFE.py", line 160, inmoduletrainer.train(
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/flair/trainers/trainer.py", line 200, in tr ain return self. train_custom(**local variables,
File "/home/ec2-user/anaconda3/envs/pytorch_p310/1ib/python3.10/site-packages/flair/trainers/trainer.py", line 601, in tr ain_custom
loss, datapoint_count = self.model. forward_
loss (batch
File "/home/ec2-user/anaconda3/envs/pytorch_p310/1ib/python3.10/site-packages/flair/models/sequence_tagger_model.py", lin e 276, in forward_loss
gold_labels = self._ prepare.
_label_tensor (sentences)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/flair/models/sequence_tagger_model.py", lin e 427, in prepare_label _tensor
gold labels = self._ get_gold_labels(sentences)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/1ib/python3.10/site-packages/flair/models/sequence_tagger_model-py", lin e 406, in _get_gold_ _labels
sentence
_labels[span[0].idx - 1] = "S-" +

Screenshots

No response

Additional Context

No response

Environment

python 3.10
flair 0.13.1

@kdk2612 kdk2612 added the bug Something isn't working label Jun 21, 2024
@kdk2612
Copy link
Author

kdk2612 commented Jun 21, 2024

@alanakbik Please provide guidance, this is urgent

@alanakbik
Copy link
Collaborator

Hello @kdk2612 the training script looks good, but I don't see where the printouts in the stacktrace are coming from ("This is the SPAN ..."). Did you modify other parts of the code?

Are you calling get_labels() somewhere near this printout? If you want only the span annotations for NER, you should call get_labels('ner') instead, as otherwise it will also iterate over the token-level annotations.

For us to be able to help, we'd need a runnable script (including dataset and embeddings) that throws the error. But from the printouts, I assume the error is thrown in custom code outside the library.

@kdk2612
Copy link
Author

kdk2612 commented Jun 21, 2024

Yes I added the print statements in the get_gold_labels() other than that I am not making any changes.
This happens during training the model and evaluation. I am not able to finish the training because of the above error

The error is happening bcoz I have some tokens "[PAD] X" in this format, this is the token + Label. My assumption is that the error happens because the X Label is expected to have a prefix "S-" or "B-" etc.

@alanakbik
Copy link
Collaborator

So the label is only "X"? Have you tried replacing the label with "B-X"?

Could you paste a part of the column corpus in plain text?

@kdk2612
Copy link
Author

kdk2612 commented Jun 21, 2024

unfortunately I cant share the data, but here is what it looks like

Token O
Token O
Token O
Token O
Token O
Token O
, O
Token O
[PAD] X

Token O
Token O
Token O
Token O
Token O
Token O
' O
Token O
Token O
Token O
Token O
Token O
Token O
Token O
Token O
. O
[PAD] X
[PAD] X
[PAD] X
[PAD] X
[PAD] X
[PAD] X
[PAD] X
[PAD] X
[PAD] X
[PAD] X
[PAD] X
[PAD] X
[PAD] X
[PAD] X
[PAD] X
[PAD] X
[PAD] X
[PAD] X
[PAD] X
[PAD] X
[PAD] X
[PAD] X
[PAD] X
[PAD] X
[PAD] X
[PAD] X
[PAD] X
[PAD] X
[PAD] X
[PAD] X

@alanakbik
Copy link
Collaborator

Ok, can you try replacing "X" with "B-X"? It should work then.

@kdk2612
Copy link
Author

kdk2612 commented Jun 21, 2024

Yes, I used "S-" instead of "B-"
Will let you know how this goes. Thanks for taking the time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants