Skip to content

Commit

Permalink
Add BioEPIC demo notebook (#463)
Browse files Browse the repository at this point in the history
  • Loading branch information
caufieldjh authored Oct 9, 2024
2 parents 88ab6d5 + 308354f commit ae1fb64
Showing 1 changed file with 339 additions and 0 deletions.
339 changes: 339 additions & 0 deletions notebooks/BioEPIC_demo.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,339 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Using OntoGPT for Evironmental and Earth Science Data Extraction"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Last updated Oct 9, 2024"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following examples demonstrate basic functionality of OntoGPT and the SPIRES method for extracting and integrating data (i.e., concepts and relationships) from texts in the envrionmental and earth science domains.\n",
"These examples assume use of the LBNL CBORG computing resource."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%pip install ontogpt\n",
"import yaml\n",
"import pprint"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Creating a template for extracting ECOSIM terms"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"EcoSIM is on BioPortal here: https://bioportal.bioontology.org/ontologies/ECOSIM\n",
"\n",
"Let's build an extraction template to find *any* EcoSIM term, then refine."
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"text = \\\n",
"\"\"\"id: http://w3id.org/ontogpt/ecosim_simple\n",
"name: ecosim_simple\n",
"title: Simple EcoSIM Extraction Template\n",
"description: >-\n",
" Simple EcoSIM Extraction Template\n",
"license: https://creativecommons.org/publicdomain/zero/1.0/\n",
"prefixes:\n",
" rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#\n",
" linkml: https://w3id.org/linkml/\n",
" ecosim_simple: http://w3id.org/ontogpt/ecosim_simple\n",
" ecosim: http://purl.obolibrary.org/obo/ecosim\n",
"\n",
"default_prefix: ecosim_simple\n",
"default_range: string\n",
"\n",
"imports:\n",
" - linkml:types\n",
" - core\n",
"\n",
"classes:\n",
" TermSet:\n",
" tree_root: true\n",
" is_a: NamedEntity\n",
" attributes:\n",
" terms:\n",
" range: Term\n",
" multivalued: true\n",
" description: >-\n",
" A semicolon-separated list of variables\n",
" for earth system simulation. Do not include\n",
" abbreviations in parentheses, e.g., \"Carbon (C)\"\n",
" should be represented as \"carbon\". Examples include:\\\n",
" carboxylation, sodium, underground irrigation.\n",
"\n",
" Term:\n",
" is_a: NamedEntity\n",
" annotations:\n",
" annotators: bioportal:ECOSIM\n",
" prompt: >-\n",
" The name of a variable for earth system simulation.\n",
"\"\"\"\n",
"with open('ecosim_simple.yaml', 'w') as outfile:\n",
" outfile.write(text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's retrieve a particular set of methods descriptions from an ESS-DIVE entry."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!wget -O essdive_test.csv https://data.ess-dive.lbl.gov/catalog/d1/mn/v2/object/ess-dive-212bbaf7e0d1597-20240919T163005378"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now a quick extraction (should take about a minute)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Replace the API key with your own\n",
"!export OPENAI_API_KEY=\"\" && ontogpt -vvv extract -t ecosim_simple.yaml -i essdive_test.csv -m lbl/cborg-chat:latest -o output.yaml --model-provider openai --api-base \"https://api.cborg.lbl.gov\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Read the output file\n",
"with open('output.yaml', 'r') as infile:\n",
" output1 = yaml.safe_load(infile)\n",
"pprint.pprint(output1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"for entity in output1[\"named_entities\"]:\n",
" if ((entity[\"id\"]).split(\":\"))[0] in [\"AUTO\"]:\n",
" print(\"NOT GROUNDED -> \" + entity[\"label\"])\n",
" else:\n",
" print(entity[\"id\"], entity[\"label\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's attempt to extract some more specific concepts and relationships.\n",
"We'll make a new template first."
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [],
"source": [
"text = \\\n",
"\"\"\"id: http://w3id.org/ontogpt/ecosim_methods\n",
"name: ecosim_methods\n",
"title: EcoSIM Methods Extraction Template\n",
"description: >-\n",
" EcoSIM Methods Extraction Template\n",
"license: https://creativecommons.org/publicdomain/zero/1.0/\n",
"prefixes:\n",
" rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#\n",
" linkml: https://w3id.org/linkml/\n",
" ecosim_simple: http://w3id.org/ontogpt/ecosim_simple\n",
" ecosim: http://purl.obolibrary.org/obo/ecosim\n",
"\n",
"default_prefix: ecosim_methods\n",
"default_range: string\n",
"\n",
"imports:\n",
" - linkml:types\n",
" - core\n",
"\n",
"classes:\n",
" TermSet:\n",
" tree_root: true\n",
" is_a: NamedEntity\n",
" attributes:\n",
" locations:\n",
" range: Location\n",
" multivalued: true\n",
" description: >-\n",
" A semicolon-separated list of research locations.\n",
" Examples include: Vermont, New York City,\n",
" Ethiopia\n",
" methods:\n",
" range: Method\n",
" multivalued: true\n",
" description: >-\n",
" A semicolon-separated list of methods used in\n",
" environmental and earth science research. Examples\n",
" include: sampling, spectroscopy\n",
" variables:\n",
" range: Variable\n",
" description: >-\n",
" A semicolon-separated list of variables measured in\n",
" environmental and earth science research. Examples\n",
" include: root shape, biomass, water turbidity\n",
" equipments:\n",
" range: Equipment\n",
" description: >-\n",
" A semicolon-separated list of equipment used in\n",
" environmental and earth science research.\n",
" equipment_to_variable_relationships:\n",
" range: EquipmentMeasuresVariable\n",
" description: >-\n",
" A semicolon separated list of relationships\n",
" between specific equipment and variables\n",
" they are used to measure as described in the input.\n",
" Example: NMR spectrometer was used to measure\n",
" chemical content\n",
" multivalued: true\n",
" inlined: true\n",
"\n",
" Location:\n",
" is_a: NamedEntity\n",
" annotations:\n",
" prompt: >-\n",
" The name of a location used in research.\n",
"\n",
" Method:\n",
" is_a: NamedEntity\n",
" annotations:\n",
" annotators: bioportal:ECOSIM\n",
" prompt: >-\n",
" The name of a method used in environment and\n",
" earth science research.\n",
"\n",
" Variable:\n",
" is_a: NamedEntity\n",
" annotations:\n",
" annotators: bioportal:ECOSIM\n",
" prompt: >-\n",
" The name of a variable measured in environment and\n",
" earth science research.\n",
"\n",
" Equipment:\n",
" is_a: NamedEntity\n",
" annotations:\n",
" prompt: >-\n",
" The name of a piece of equipment used in\n",
" environment and earth science research.\n",
"\n",
" EquipmentMeasuresVariable:\n",
" is_a: CompoundExpression\n",
" attributes:\n",
" equipment:\n",
" range: Equipment\n",
" description: Name of the equipment used to measure a variable.\n",
" variable:\n",
" range: Variable\n",
" description: Name of the variable being measured.\n",
"\n",
"\"\"\"\n",
"with open('ecosim_methods.yaml', 'w') as outfile:\n",
" outfile.write(text)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Replace the API key with your own\n",
"!export OPENAI_API_KEY=\"\" && ontogpt -vvv extract -t ecosim_methods.yaml -i essdive_test.csv -m lbl/cborg-chat:latest -o output.yaml --model-provider openai --api-base \"https://api.cborg.lbl.gov\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Read the output file\n",
"with open('output.yaml', 'r') as infile:\n",
" output1 = yaml.safe_load(infile)\n",
"pprint.pprint(output1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.13"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

0 comments on commit ae1fb64

Please sign in to comment.