This repository contains supplementary data, and links to the model and corpora used for the paper Transfer learning for biomedical named entity recognition with neural networks.
Corpora pre-processing steps were collected in a single script with a jupyter notebook for ease-of-use. Script and notebook can be found in code
.
The model used in this study is NeuroNER [1], a domain-independent named entity recognizer (NER) based on a bi-directional long short term memory network-conditional random field (LSTM-CRF). A repository for the model can be found here.
NeuroNER uses standard python config files to specify hyperparameters. We provide three of these config files for reproducibility (see code/configs
):
baseline.ini
: config used while training on the target data sets (i.e., the baseline.)source.ini
: config used while training on the source data sets.transfer.ini
: config used while transferring a model trained on the source data set for training on a target data set.
The word embeddings used in this study were obtained from here [2]. Code for converting the word vectors to the .txt
format necessary for use with NeuroNER can be found in the jupyter notebook in code
, under data cleaning.
All corpora used in this study (which can be re-distributed) are in the corpora
folder (given in Brat-standoff format).
Data can be uncompressed with the following command:
tar -zxvf <name_of_corpora>
.
Alternatively, the corpora can be publicly accessed at the following links:
Corpora | Text Genre | Standard | Entities | Publication |
---|---|---|---|---|
AZDC | Scientific Article | Gold | disease | link |
BioCreative II GM | Scientific Article | Gold | genes/proteins | link |
BioInfer | Scientific Article | Gold | genes/proteins | link |
BioSemantics | Patent | Gold | chemicals, disease | link |
CALBC-III-Small | Scientific Article | Silver | chemicals, diseases, species, genes/proteins | link |
CDR | Scientific Article | Gold | chemicals, diseases | link |
CellFinder | Scientific Article | Gold | species, gene/proteins, cells, anatomy | link |
CHEMDNER Patent | Patent | Gold | chemicals | link |
DECA | Scientific Article | Gold | gene/proteins | link |
FSU-PRGE | Scientific Article | Gold | genes/proteins | link |
Linneaus | Scientific Article | Gold | species | link |
LocText | Scientific Article | Gold | species, genes/proteins | link |
IEPA | Scientific Article | Gold | genes/proteins | link |
miRNA | Scientific Article | Gold | diseases, species, genes/proteins | link |
NCBI disease | Scientific Article | Gold | diseases | link |
S800 | Scientific Article | Gold | species | link |
Variome | Scientific Article | Gold | diseases, species, genes/proteins | link |
Many of these corpora can also be accessed and visualized in the browser here [3].
The supplementary data can be found in the file supplementary/additional_file_1.pdf
. Additionally, blacklists used for the silver-standard corpora (SSCs) can be found in supplementary/blacklists
.
- Dernoncourt, F., Lee, J. Y., & Szolovits, P. (2017). NeuroNER: an easy-to-use program for named-entity recognition based on neural networks. arXiv preprint arXiv:1705.05487.
- Moen, S. P. F. G. H., & Ananiadou, T. S. S. (2013). Distributional semantics resources for biomedical text processing. In Proceedings of the 5th International Symposium on Languages in Biology and Medicine, Tokyo, Japan (pp. 39-43).
- Stenetorp, P., Topić, G., Pyysalo, S., Ohta, T., Kim, J. D., & Tsujii, J. I. (2011, June). BioNLP shared task 2011: Supporting resources. In Proceedings of the BioNLP Shared Task 2011 Workshop (pp. 112-120). Association for Computational Linguistics.