Project: Neural Named Entity Recognition for Scientific and Vernacular Plant Names
Author: Isabel Meraner
Institute of Computational Linguistics, University of Zurich (Switzerland), 2019
The resources and scripts in this repository have been created for a master thesis project on "Neural Named Entity Recognition for Scientific and Vernacular Plant Name" at the University of Zurich.
https://www.cl.uzh.ch/en/studies/theses/lic-master-theses.html
The main focus of the project was to identify and disambiguate scientific and vernacular plant names across multiple German and English text genres and to provide a valuable tool in order to extract and preserve (ethno-)botanical knowledge.
If you have any questions or suggestions concerning this project, please don't hesitate to contact me!
This repository contains two subfolders “SCRIPTS” and “RESOURCES”.
In the RESOURCES folder, you can find sample output material and data resources.
Please note that the bi-LSTM-CRF architecture used for training was developed by Lample et al. (2016):
Lample et al. (2016). Neural Architectures for Named Entity Recognition. URL= https://arxiv.org/abs/1603.01360
The adapted files from the bi-LSTM-CRF tagger by Lample et al. (2016) can be found under 'scripts/web_interface/tagger-master/'.
• plantblog_corpus_{de|en}.tok.pos.iob.txt
• wiki_abstractcorpus_{de|en}.tok.pos.iob.txt
• TextBerg_subcorpus_{de|en}.tok.pos.iob.txt
• botlit_corpus_{de|en}.tok.pos.iob.txt
• combined.test.fold1GOLD_{de|en}.txt
• test_fungi_{de|en}.tok.pos.iobGOLD.txt
Due to copyright restrictions, these gazetteers only comprise a subset based on plant names retrieved from Wikipedia of our original gazetteers.
• de_fam.txt
• de_species.txt
• en_fam.txt
• en_species.txt
• lat_fam.txt
• lat_species.txt
• lat_genus.txt
• lat_subfam.txt
• lat_class.txt
• lat_order.txt
• lat_phylum.txt
• {de|en}_lat_referencedatabase.tsv
• model_combined_chardim29_de
• model_wiki_dropout0.3_de
• model_tb_dropout0.7_de
• model_plantblog_capdim1_de
• model_botlit_dropout0.3_de
• model_combined_dropout0.7_en
• model_wiki_chardim29_en
• model_tb_capdim1_en
• model_plantblog_chardim50_en
• model_s800_dropout0.7_en
• model_wiki_crosscorpus_de_dropout0.3 (cross-corpus setting)
• model_wiki_crosscorpus_de_capdim1 (fungi test set)
• model_wiki_crosscorpus_en_preemb_dropout0.5 (cross-corpus setting)
• model_wiki_crosscorpus_en_capdim1 (fungi test set)
• predictions_wiki_{de|en}.output
• predictions_textberg_{de|en}.output
• predictions_blogs_{de|en}.output
• predictions_botlit_{de|en}.output
• predictions_model_wiki_test_textberg_{de|en}.output
• predictions_model_wiki_test_blogs_{de|en}.output
• predictions_model_wiki_test_botlit_{de|en}.output
• {de|en}_lat_referencedatabase.tsv
• json_data_wiki_{de|en}.json
• json_data_textberg_{de|en}.json
• json_data_blogs_{de|en}.json
• json_data_botlit_{de|en}.json
In the SCRIPTS folder, you can find all Python and bash scripts that have been used during training:
$ python3 get_subset_textberg.py -i ./../TextBerg/SAC/ -o ./subset_textberg_de.txt -g ./../resources/gazetteers/ -l de
$ python3 add_latin_abbreviations.py -i ./../resources/gazetteers/lat/lat_species.txt -o ./outfile.txt
$ python3 add_german_variants.py -i ./../resources/gazetteers/de/de_fam.txt -o ./outfile.txt
$ python3 add_compound_variants.py -i ./../resources/gazetteers/de/de species.txt -o ./outfileGAZ.txt
$ python3 create_gazetteers.py -i ./../resources/gazetteers/de/de_species.txt -o outfile.txt
$ python3 add_variants_database.py -i ./../resources/gazetteers/lookup_table/de_lat_referencedatabase.tsv -o ./outfile
$ python3 get_wiki_fungi_testset.py -o ./outfile.txt -c Pilze -l de
$ python3 retrieve_wiki_sections.py -i ./../resources/gazetteers/lat/lat_species.txt -t ./outfile_trivialsections.txt -a outfile_wikiabstracts.txt -l de
$ python3 extracttaxa_cat_of_life -t ./colarchive/taxa/ -v ./colarchive/vernacular/ -l ./latin.out -d ./german.out -e ./english.out -r rest_vernacular.out
$ python3 tokenize_corpus.py -d ./raw_data/ -l de
$ python3 ./treetagger-python_miotto/pos_tag_corpus.py -d ./../resources/corpora/
$ python3 iobannotate_corpus_de.py -d ./../resources/corpora/training_corpora/de/ -v ./../resources/gazetteers/de/ -s ./../resources/gazetteers/lat/ -l de
$ python3 iobannotate_corpus_en.py -d ./../resources/corpora/training_corpora/en/ -v ./../resources/gazetteers/en/ -s ./../resources/gazetteers/lat/ -l de:
$ python3 kfold_crossvalidation.py -d ./../resources/corpora/training corpora/de/
$ bash bashscript_5foldtraining_preemb_en.sh
$ bash bashscript_5foldtraining_preemb_de.sh
$ python train_no_dev.py
$ python utils.py
$ python final_eval_kfold.py -d ./../../evaluation/baseline/model_baseline/ -o ./evaluation_files/
$ python evaluate_gold_silver.py -s ./../resources/corpora/gold_standard/de/alldata.test.fold1SILVER de.txt -g ./../resources/corpora/gold_standard/de/combined.test.fold1GOLD de.txt
$ python3 cross_dataset_evaluation.py -s ./silver_standard/plantblog_corpus.test.fold1.txt -t ./tagged_data/model_wiki_test_blog_f1_dropout5.tsv
$ python3 file_statistics.py -i ./../resources/corpora/training_corpora/de/
$ python3 transform_iob_to_sentences.py -i ./../resources/corpora/training_corpora/de/botlit_corpus_de.tok.pos.iob.txt -o botlit_sentences.txt
$ python3 entity_linker.py -i ./../resources/corpora/training_corpora/de/botlit_corpus de.tok.pos.iob.txt -o ./json_file.json -f IOB -r ./../resources/gazetteers/lookup_table/de_lat_referencedatabase.tsv -l True