Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore tokenizer optimization strategies to improve precision/recall. #84

Open
1 task done
bruffridge opened this issue Sep 2, 2021 · 0 comments
Open
1 task done

Comments

@bruffridge
Copy link
Member

bruffridge commented Sep 2, 2021

Optimization Strategies:

  1. Paht's research indicates there may be a 25,000 token limit in SciBERT/Huggingface. Is this limitation present in MATCH's tokenizer? Is 25,000 tokens enough to cover the vocabulary we are dealing with. Would increasing this number improve precision/recall?

  2. See if using stemming/lemmatization to reduce tokens improves precision/recall.

  3. Input sequences truncated to 500 tokens in MATCH. This may result in truncated abstracts (not an issue in our current dataset, see Eric's comment below).

  • See if using the full abstract improves precision/recall, either by chunking, or stemming/lemmatization to reduce the number of tokens. (not an issue at this time)

Eric: MATCH pads/truncates its input token sequences to a default of 500 tokens (see truncate_text in auto-labeler/MATCH/src/MATCH/deepxml/data_utils.py); my scripts don't mess with that. My intuition is that we don't need the full abstract and the first 500 words (minus the number of metadata tokens) should be enough?
Eric: Yes, the full abstract would be strictly more useful than only part of it. However, I think the difference is negligible in our dataset; only 2 of 1149 papers in my training and test sets have a token sequence (metadata + title + abstract) longer than 500 tokens.

Paht: there's several ways of handling this.

  • We break the abstract down to 500 word chunks and give all chunks the same labels
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant