-
Notifications
You must be signed in to change notification settings - Fork 86
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
339 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,339 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Using OntoGPT for Evironmental and Earth Science Data Extraction" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Last updated Oct 9, 2024" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"The following examples demonstrate basic functionality of OntoGPT and the SPIRES method for extracting and integrating data (i.e., concepts and relationships) from texts in the envrionmental and earth science domains.\n", | ||
"These examples assume use of the LBNL CBORG computing resource." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Setup" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"%pip install ontogpt\n", | ||
"import yaml\n", | ||
"import pprint" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Creating a template for extracting ECOSIM terms" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"EcoSIM is on BioPortal here: https://bioportal.bioontology.org/ontologies/ECOSIM\n", | ||
"\n", | ||
"Let's build an extraction template to find *any* EcoSIM term, then refine." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 28, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"text = \\\n", | ||
"\"\"\"id: http://w3id.org/ontogpt/ecosim_simple\n", | ||
"name: ecosim_simple\n", | ||
"title: Simple EcoSIM Extraction Template\n", | ||
"description: >-\n", | ||
" Simple EcoSIM Extraction Template\n", | ||
"license: https://creativecommons.org/publicdomain/zero/1.0/\n", | ||
"prefixes:\n", | ||
" rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#\n", | ||
" linkml: https://w3id.org/linkml/\n", | ||
" ecosim_simple: http://w3id.org/ontogpt/ecosim_simple\n", | ||
" ecosim: http://purl.obolibrary.org/obo/ecosim\n", | ||
"\n", | ||
"default_prefix: ecosim_simple\n", | ||
"default_range: string\n", | ||
"\n", | ||
"imports:\n", | ||
" - linkml:types\n", | ||
" - core\n", | ||
"\n", | ||
"classes:\n", | ||
" TermSet:\n", | ||
" tree_root: true\n", | ||
" is_a: NamedEntity\n", | ||
" attributes:\n", | ||
" terms:\n", | ||
" range: Term\n", | ||
" multivalued: true\n", | ||
" description: >-\n", | ||
" A semicolon-separated list of variables\n", | ||
" for earth system simulation. Do not include\n", | ||
" abbreviations in parentheses, e.g., \"Carbon (C)\"\n", | ||
" should be represented as \"carbon\". Examples include:\\\n", | ||
" carboxylation, sodium, underground irrigation.\n", | ||
"\n", | ||
" Term:\n", | ||
" is_a: NamedEntity\n", | ||
" annotations:\n", | ||
" annotators: bioportal:ECOSIM\n", | ||
" prompt: >-\n", | ||
" The name of a variable for earth system simulation.\n", | ||
"\"\"\"\n", | ||
"with open('ecosim_simple.yaml', 'w') as outfile:\n", | ||
" outfile.write(text)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Let's retrieve a particular set of methods descriptions from an ESS-DIVE entry." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"!wget -O essdive_test.csv https://data.ess-dive.lbl.gov/catalog/d1/mn/v2/object/ess-dive-212bbaf7e0d1597-20240919T163005378" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Now a quick extraction (should take about a minute)." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Replace the API key with your own\n", | ||
"!export OPENAI_API_KEY=\"\" && ontogpt -vvv extract -t ecosim_simple.yaml -i essdive_test.csv -m lbl/cborg-chat:latest -o output.yaml --model-provider openai --api-base \"https://api.cborg.lbl.gov\"" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Read the output file\n", | ||
"with open('output.yaml', 'r') as infile:\n", | ||
" output1 = yaml.safe_load(infile)\n", | ||
"pprint.pprint(output1)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"for entity in output1[\"named_entities\"]:\n", | ||
" if ((entity[\"id\"]).split(\":\"))[0] in [\"AUTO\"]:\n", | ||
" print(\"NOT GROUNDED -> \" + entity[\"label\"])\n", | ||
" else:\n", | ||
" print(entity[\"id\"], entity[\"label\"])" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Now let's attempt to extract some more specific concepts and relationships.\n", | ||
"We'll make a new template first." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 36, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"text = \\\n", | ||
"\"\"\"id: http://w3id.org/ontogpt/ecosim_methods\n", | ||
"name: ecosim_methods\n", | ||
"title: EcoSIM Methods Extraction Template\n", | ||
"description: >-\n", | ||
" EcoSIM Methods Extraction Template\n", | ||
"license: https://creativecommons.org/publicdomain/zero/1.0/\n", | ||
"prefixes:\n", | ||
" rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#\n", | ||
" linkml: https://w3id.org/linkml/\n", | ||
" ecosim_simple: http://w3id.org/ontogpt/ecosim_simple\n", | ||
" ecosim: http://purl.obolibrary.org/obo/ecosim\n", | ||
"\n", | ||
"default_prefix: ecosim_methods\n", | ||
"default_range: string\n", | ||
"\n", | ||
"imports:\n", | ||
" - linkml:types\n", | ||
" - core\n", | ||
"\n", | ||
"classes:\n", | ||
" TermSet:\n", | ||
" tree_root: true\n", | ||
" is_a: NamedEntity\n", | ||
" attributes:\n", | ||
" locations:\n", | ||
" range: Location\n", | ||
" multivalued: true\n", | ||
" description: >-\n", | ||
" A semicolon-separated list of research locations.\n", | ||
" Examples include: Vermont, New York City,\n", | ||
" Ethiopia\n", | ||
" methods:\n", | ||
" range: Method\n", | ||
" multivalued: true\n", | ||
" description: >-\n", | ||
" A semicolon-separated list of methods used in\n", | ||
" environmental and earth science research. Examples\n", | ||
" include: sampling, spectroscopy\n", | ||
" variables:\n", | ||
" range: Variable\n", | ||
" description: >-\n", | ||
" A semicolon-separated list of variables measured in\n", | ||
" environmental and earth science research. Examples\n", | ||
" include: root shape, biomass, water turbidity\n", | ||
" equipments:\n", | ||
" range: Equipment\n", | ||
" description: >-\n", | ||
" A semicolon-separated list of equipment used in\n", | ||
" environmental and earth science research.\n", | ||
" equipment_to_variable_relationships:\n", | ||
" range: EquipmentMeasuresVariable\n", | ||
" description: >-\n", | ||
" A semicolon separated list of relationships\n", | ||
" between specific equipment and variables\n", | ||
" they are used to measure as described in the input.\n", | ||
" Example: NMR spectrometer was used to measure\n", | ||
" chemical content\n", | ||
" multivalued: true\n", | ||
" inlined: true\n", | ||
"\n", | ||
" Location:\n", | ||
" is_a: NamedEntity\n", | ||
" annotations:\n", | ||
" prompt: >-\n", | ||
" The name of a location used in research.\n", | ||
"\n", | ||
" Method:\n", | ||
" is_a: NamedEntity\n", | ||
" annotations:\n", | ||
" annotators: bioportal:ECOSIM\n", | ||
" prompt: >-\n", | ||
" The name of a method used in environment and\n", | ||
" earth science research.\n", | ||
"\n", | ||
" Variable:\n", | ||
" is_a: NamedEntity\n", | ||
" annotations:\n", | ||
" annotators: bioportal:ECOSIM\n", | ||
" prompt: >-\n", | ||
" The name of a variable measured in environment and\n", | ||
" earth science research.\n", | ||
"\n", | ||
" Equipment:\n", | ||
" is_a: NamedEntity\n", | ||
" annotations:\n", | ||
" prompt: >-\n", | ||
" The name of a piece of equipment used in\n", | ||
" environment and earth science research.\n", | ||
"\n", | ||
" EquipmentMeasuresVariable:\n", | ||
" is_a: CompoundExpression\n", | ||
" attributes:\n", | ||
" equipment:\n", | ||
" range: Equipment\n", | ||
" description: Name of the equipment used to measure a variable.\n", | ||
" variable:\n", | ||
" range: Variable\n", | ||
" description: Name of the variable being measured.\n", | ||
"\n", | ||
"\"\"\"\n", | ||
"with open('ecosim_methods.yaml', 'w') as outfile:\n", | ||
" outfile.write(text)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Replace the API key with your own\n", | ||
"!export OPENAI_API_KEY=\"\" && ontogpt -vvv extract -t ecosim_methods.yaml -i essdive_test.csv -m lbl/cborg-chat:latest -o output.yaml --model-provider openai --api-base \"https://api.cborg.lbl.gov\"" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Read the output file\n", | ||
"with open('output.yaml', 'r') as infile:\n", | ||
" output1 = yaml.safe_load(infile)\n", | ||
"pprint.pprint(output1)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": ".venv", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.10.13" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |