-
Notifications
You must be signed in to change notification settings - Fork 115
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Implement CoNECo dataset * fix style * fix issues raised in comments * fix licence * fix citetion in README * fix flake8 * filter out of scope entities --------- Co-authored-by: Oğuz Şerbetçi <[email protected]>
- Loading branch information
1 parent
097f93d
commit b15412c
Showing
3 changed files
with
857 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,57 @@ | ||
--- | ||
language: | ||
- en | ||
bigbio_language: | ||
- English | ||
license: cc-by-4.0 | ||
bigbio_license_shortname: CC_BY_4p0 | ||
multilinguality: monolingual | ||
pretty_name: CoNECo | ||
homepage: https://zenodo.org/records/11263147 | ||
bigbio_pubmed: false | ||
bigbio_public: true | ||
bigbio_tasks: | ||
- NAMED_ENTITY_RECOGNITION | ||
- NAMED_ENTITY_DISAMBIGUATION | ||
paperswithcode_id: coneco | ||
--- | ||
|
||
|
||
# Dataset Card for CoNECo | ||
|
||
## Dataset Description | ||
|
||
- **Homepage:** https://zenodo.org/records/11263147 | ||
- **Pubmed:** False | ||
- **Public:** True | ||
- **Tasks:** NER, NEN | ||
|
||
Complex Named Entity Corpus (CoNECo) is an annotated corpus for NER and NEN of protein-containing complexes. CoNECo comprises 1,621 documents with 2,052 entities, 1,976 of which are normalized to Gene Ontology. We divided the corpus into training, development, and test sets. | ||
|
||
## Citation Information | ||
|
||
``` | ||
@article{10.1093/bioadv/vbae116, | ||
author = {Nastou, Katerina and Koutrouli, Mikaela and Pyysalo, Sampo and Jensen, Lars Juhl}, | ||
title = "{CoNECo: A Corpus for Named Entity Recognition and Normalization of Protein Complexes}", | ||
journal = {Bioinformatics Advances}, | ||
pages = {vbae116}, | ||
year = {2024}, | ||
month = {08}, | ||
abstract = "{Despite significant progress in biomedical information extraction, there is a lack of resources \ | ||
for Named Entity Recognition (NER) and Normalization (NEN) of protein-containing complexes. Current resources \ | ||
inadequately address the recognition of protein-containing complex names across different organisms, underscoring \ | ||
the crucial need for a dedicated corpus.We introduce the Complex Named Entity Corpus (CoNECo), an annotated \ | ||
corpus for NER and NEN of complexes. CoNECo comprises 1,621 documents with 2,052 entities, 1,976 of which are \ | ||
normalized to Gene Ontology. We divided the corpus into training, development, and test sets and trained both a \ | ||
transformer-based and dictionary-based tagger on them. Evaluation on the test set demonstrated robust performance, \ | ||
with F-scores of 73.7\\% and 61.2\\%, respectively. Subsequently, we applied the best taggers for comprehensive \ | ||
tagging of the entire openly accessible biomedical literature.All resources, including the annotated corpus, \ | ||
training data, and code, are available to the community through Zenodo https://zenodo.org/records/11263147 and \ | ||
GitHub https://zenodo.org/records/10693653.}", | ||
issn = {2635-0041}, | ||
doi = {10.1093/bioadv/vbae116}, | ||
url = {https://doi.org/10.1093/bioadv/vbae116}, | ||
eprint = {https://academic.oup.com/bioinformaticsadvances/advance-article-pdf/doi/10.1093/bioadv/vbae116/58869902/vbae116.pdf}, | ||
} | ||
``` |
Oops, something went wrong.