You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Paht's research indicates there may be a 25,000 token limit in SciBERT/Huggingface. Is this limitation present in MATCH's tokenizer? Is 25,000 tokens enough to cover the vocabulary we are dealing with. Would increasing this number improve precision/recall?
See if using stemming/lemmatization to reduce tokens improves precision/recall.
Input sequences truncated to 500 tokens in MATCH. This may result in truncated abstracts (not an issue in our current dataset, see Eric's comment below).
See if using the full abstract improves precision/recall, either by chunking, or stemming/lemmatization to reduce the number of tokens. (not an issue at this time)
Eric: MATCH pads/truncates its input token sequences to a default of 500 tokens (see truncate_text in auto-labeler/MATCH/src/MATCH/deepxml/data_utils.py); my scripts don't mess with that. My intuition is that we don't need the full abstract and the first 500 words (minus the number of metadata tokens) should be enough?
Eric: Yes, the full abstract would be strictly more useful than only part of it. However, I think the difference is negligible in our dataset; only 2 of 1149 papers in my training and test sets have a token sequence (metadata + title + abstract) longer than 500 tokens.
Paht: there's several ways of handling this.
We break the abstract down to 500 word chunks and give all chunks the same labels
The text was updated successfully, but these errors were encountered:
Optimization Strategies:
Paht's research indicates there may be a 25,000 token limit in SciBERT/Huggingface. Is this limitation present in MATCH's tokenizer? Is 25,000 tokens enough to cover the vocabulary we are dealing with. Would increasing this number improve precision/recall?
See if using stemming/lemmatization to reduce tokens improves precision/recall.
Input sequences truncated to 500 tokens in MATCH. This may result in truncated abstracts (not an issue in our current dataset, see Eric's comment below).
The text was updated successfully, but these errors were encountered: