Medicinal plants have demonstrated therapeutic potentials for a wide range of observable characteristics in a human body, called ``phenotype'', in clinical treatment over the past thousands of years. As the interest in plants has increased, many researchers have been trying to extract meaningful information by identifying relationships between plants and phenotypes from the accumulated literature. While the natural language processing (NLP) technique aims to extract useful information from unstructured text data, there was no appropriate corpus to train and evaluate the NLP model for plants and phenotypes. Therefore, we present the Plant-Phenotype Relationship (PPR) corpus, a high-quality resource to support the development of various NLP fields, which consists of 600 PubMed abstracts corresponding to 5668 plant and 11,282 phenotype entities, and a total of 9709 relationships. We also describe benchmark results through named entity recognition and relation extraction systems to verify our data quality and show the significant performance of NLP tasks in the PPR test set.
Corresponding author: Hyunju Lee ([email protected])