Explore tokenizer optimization strategies to improve precision/recall. #84

bruffridge · 2021-09-02T17:24:01Z

Optimization Strategies:

Paht's research indicates there may be a 25,000 token limit in SciBERT/Huggingface. Is this limitation present in MATCH's tokenizer? Is 25,000 tokens enough to cover the vocabulary we are dealing with. Would increasing this number improve precision/recall?
See if using stemming/lemmatization to reduce tokens improves precision/recall.
Input sequences truncated to 500 tokens in MATCH. This may result in truncated abstracts (not an issue in our current dataset, see Eric's comment below).

See if using the full abstract improves precision/recall, either by chunking, or stemming/lemmatization to reduce the number of tokens. (not an issue at this time)

Eric: MATCH pads/truncates its input token sequences to a default of 500 tokens (see truncate_text in auto-labeler/MATCH/src/MATCH/deepxml/data_utils.py); my scripts don't mess with that. My intuition is that we don't need the full abstract and the first 500 words (minus the number of metadata tokens) should be enough?
Eric: Yes, the full abstract would be strictly more useful than only part of it. However, I think the difference is negligible in our dataset; only 2 of 1149 papers in my training and test sets have a token sequence (metadata + title + abstract) longer than 500 tokens.

Paht: there's several ways of handling this.

We break the abstract down to 500 word chunks and give all chunks the same labels

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore tokenizer optimization strategies to improve precision/recall. #84

Explore tokenizer optimization strategies to improve precision/recall. #84

bruffridge commented Sep 2, 2021 •

edited

Loading

Explore tokenizer optimization strategies to improve precision/recall. #84

Explore tokenizer optimization strategies to improve precision/recall. #84

Comments

bruffridge commented Sep 2, 2021 • edited Loading

bruffridge commented Sep 2, 2021 •

edited

Loading