You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
def tokenize(self, text: str):
""" tokenize input"""
words = word_tokenize(text)
tokens = []
valid_positions = [] for i,word in enumerate(words):
token = self.tokenizer.tokenize(word)
tokens.extend(token) for i in range(len(token)): if i == 0:
valid_positions.append(1)
else:
valid_positions.append(0)
return tokens, valid_positions
what's the 3rd i: i==0 ?
maybe change the 2nd for-loop with another iter-var.
The text was updated successfully, but these errors were encountered:
The data is in the following format: TOKEN NNP B-NP O
So after the word_tokenize in the for loop, only the first element, the token, is given a 1 for masking. This will mask everything, but not the token on the first position (see 'attention_mask' on https://huggingface.co/transformers/model_doc/bert.html#bertmodel)
The first 'i' is maybe optional, but probably there for creating an iterator with the enumerate. The second and third 'i' are probably there to reduce the number of variables.
#bert.py#49l
def tokenize(self, text: str):
""" tokenize input"""
words = word_tokenize(text)
tokens = []
valid_positions = []
for i,word in enumerate(words):
token = self.tokenizer.tokenize(word)
tokens.extend(token)
for i in range(len(token)):
if i == 0:
valid_positions.append(1)
else:
valid_positions.append(0)
return tokens, valid_positions
what's the 3rd i: i==0 ?
maybe change the 2nd for-loop with another iter-var.
The text was updated successfully, but these errors were encountered: