Identify and summarize biological strategies contained in research papers. #1

bruffridge · 2022-05-31T15:04:27Z

The goal is to identify and summarize biological strategies found in research papers or other datasources, then abstract them into design strategies.

A biological strategy is a characteristic, mechanism, or process that an organism or ecosystem exhibits to accomplish a particular purpose or function within a particular context or conditions.

The main elements of a biological strategy are:

The organism or ecosystem
The part of the organism
Function (what it does or accomplishes)
Mechanism (how it does it)
Context (environment, conditions, constraints, stressors)

An example biological strategy:

Strategy: The harbor seal’s whiskers possess a specialized undulated surface structure that reduces vortex-induced vibrations as the whiskers move through water.
Organism: harbor seal
Part of: Whiskers
Function: reduce vibrations
Mechanism: a specialized undulated surface structure
Context: Moving through water

A bio-inspired design strategy is a statement that articulates the function, mechanism, and context without using biological terms. Instead biological terms are replaced with discipline-neutral synonyms (e.g. replace “fur” with “fibers,” or “skin” with “membrane”).

An example design strategy:

Strategy: While moving through a liquid, a small diameter fiber with an undulated surface structure reduces vortex-induced vibrations.

Inputs: raw text (such as the title and abstract of a biology journal article) #13
Outputs: Biological strategy, design strategy, Organism, Part of, Function, Mechanism, Context.

Early evaluation of large language models such as GPT-3 Davinci have shown promise for generating these outputs given a proper prompt and a few training examples (1-3).

AskNature has manually curated a list of biological strategies on its website, based on research papers. These may be helpful in training a machine learning model. #12

Open source large language models

OPT
GPT-J, GPT-NeoX-20B
Bloom

Commercial large language models

OpenAI GPT-3

Alternatives to large language models for text summarization (may not work as well)
https://aws.amazon.com/blogs/machine-learning/part-1-set-up-a-text-summarization-project-with-hugging-face-transformers/
https://www.projectpro.io/article/transformers-bart-model-explained/553#mcetoc_1fq07mh0qa
https://paperswithcode.com/sota/text-summarization-on-gigaword

bruffridge · 2022-06-13T19:27:44Z

https://huggingface.co/models?pipeline_tag=summarization&sort=downloads

bruffridge · 2022-06-13T20:47:55Z

Perhaps another way to think about the problem is named entity recognition, part-of-speech tagging may also be helpful. "a specialized undulated surface structure" - form, "reduce vibrations" - function, "moving through water" - context.

bruffridge · 2022-06-14T13:14:09Z

Some other resources that may be helpful courtesy of Herb:
https://github.com/marshmellow77/text-summarisation-project
https://aws.amazon.com/marketplace/pp/prodview-uzkcdmjuagetk
https://aws.amazon.com/comprehend/
https://huggingface.co/spaces/ml6team/keyphrase-extraction

bruffridge · 2022-06-14T14:14:19Z

Since multiple people will be working on this issue, it may be helpful to create different branches in the bio-strategy-extractor repository to track code and results for different evaluated methods. Also, please coordinate efforts so different approaches to solving the problem can be explored and results compared.

bruffridge · 2022-06-14T15:53:06Z

A colleague just informed me of a paper entitled, "Categorizing biological information based on
function–morphology for bioinspired conceptual design". I uploaded it to the Literature folder in Box. Please review for potential application to this problem.

bruffridge · 2022-06-14T16:15:54Z

A Question Answering model may be another approach worth looking at.
https://huggingface.co/models?pipeline_tag=question-answering&sort=downloads
https://paperswithcode.com/task/question-answering
https://github.com/sebastianruder/NLP-progress/blob/master/english/question_answering.md

For example, check out the results from this QA model when asked "What is the primary function?" and "What reduces vibrations?"

rishub-tamirisa · 2022-06-15T17:07:32Z

This paper, Rhetorical Sentence Categorization for Scientific Paper using Word2Vec Semantic Representation seems interesting. For enabling a model to actually locate where parts of a paper/abstract are describing biomimetic function, we could hand annotate a few known sentences and look for cosine similarity using Word2Vec between the labeled sentences and unlabeled sentences (comparing the averages of the vectors in each sentence).

I'm not sure if this would work well, but the reason something like this might be useful is that summarizing an abstract generally may not always result automatically identifying the specific biomimetic function(s) described.

Perhaps another way to think about the problem is named entity recognition, part-of-speech tagging may also be helpful. "a specialized undulated surface structure" - form, "reduce vibrations" - function, "moving through water" - context.

I agree that this could work well, probably by fine-tuning an existing NER model, but the challenge would be to create the training data.

hschilling · 2022-06-15T18:12:37Z

Rishub, for getting the training data, we do have access to Amazon Ground Truth, a labeling service, if that helps

bruffridge · 2022-06-16T16:27:40Z

Semantic Role Labelling might be useful for extracting Who (form), What (function), Where (context):
https://paperswithcode.com/task/semantic-role-labeling
https://nlpprogress.com/english/semantic_role_labeling.html
https://web.stanford.edu/~jurafsky/slp3/19.pdf

abalai-ash · 2022-06-16T16:50:31Z

This survey academic journal of summarization techniques may be of use:

https://arxiv.org/pdf/1704.03242.pdf

A couple journals worth looking into:

More fine-tuning journals:

This might help with what I need to do. Not sure if this will be useful to anybody else:

https://digitalcommons.calpoly.edu/cgi/viewcontent.cgi?article=1273&context=cpesp

rishub-tamirisa · 2022-06-16T17:39:30Z

https://arxiv.org/pdf/2106.01592v1.pdf : Biomimicry AI overview

rishub-tamirisa · 2022-06-16T17:54:21Z

https://asmedigitalcollection.asme.org/mechanicaldesign/article-abstract/136/8/081008/454553/Retrieving-Causally-Related-Functions-From-Natural (pdf): Biomimicry function identification.

abalai-ash · 2022-06-16T17:59:00Z

Steps for NLP Pipeline that we can implement in our algorithm after further literature research:

Sentence segmentation: breaks the given paragraph into separate sentences.
Word tokenization: extract the words from each sentence one by one.
'Parts of Speech' Prediction: identifying parts of speech.
Text Lemmatization: figure out the most basic form of each word in a sentence. "Germ" and "Germs" can have two different meanings and we should look to solve that.
'Stop Words' Identification: English has a lot of filter words that appear very frequently and that introduces a lot of noise.
Dependency Parsing: uses the grammatical laws to figure out how the words relate to one another.
Entity Analysis: go through the text and identify all of the important words or “entities” in the text.
Pronouns Parsing: keeps track of the pronouns with respect to the context of the sentence.

bruffridge · 2022-06-17T15:53:57Z

@rishub-tamirisa Good find on the biomimicry function identification paper. Here's a list of subsequent papers that cited this one, many of which appear to be relevant. https://www.lens.org/lens/scholar/article/088-258-820-290-519/citations/citing

Here's one in particular that looks interesting: http://ceur-ws.org/Vol-2831/paper4.pdf

rishub-tamirisa · 2022-06-17T16:06:11Z

Thanks. That paper you linked does look interesting. "The preliminary results indicate that the ability to add ontologies to IBID allows it to extract meaning from new documents." I'm definitely going to take a look at the rest of it.

bruffridge · 2022-06-21T14:59:34Z

From Nagel: https://www.mdpi.com/2411-9660/2/4/47/htm

rishub-tamirisa · 2022-06-23T18:56:40Z

https://arxiv.org/pdf/1909.07755.pdf : SpERT Paper

bruffridge · 2022-06-23T21:01:04Z

It may be easier to focus initially on one function then expand the pipeline to include other functions. For example, the function "modify/convert thermal energy". Then the problem becomes trying to identify sections of text that describe managing thermal energy. Next comes identifying the "how"; or what does the text describe as the mechanism responsible for the management of thermal energy. Eventually we may want to classify these various "hows" or "strategies" into different categories (form, material, structure, process, or system).

abalai-ash · 2022-06-30T17:42:45Z

DANE paper: https://hal.inria.fr/hal-02279772/file/474537_1_En_1_Chapter.pdf

abalai-ash · 2022-06-30T18:30:51Z

This goes over BioNER: https://www.frontiersin.org/articles/10.3389/fcell.2020.00673/full
This is a page linking to the GitHub for BioNER: https://github.com/JogleLew/bioner-cross-sharing
Brown Clustering algorithm with paper links in the README: https://github.com/yangyuan/brown-clustering
Another academic journal I glanced over. It is called NERO and it is a biomedical named entity (recognition) ontology: https://www.nature.com/articles/s41540-021-00200-x
Different models for entity recognition: https://danlp-alexandra.readthedocs.io/en/latest/docs/tasks/ner.html

I will look at more methods for unsupervised learning for entity recognition. I will also look into DANE tutorials. It looks like it is showing DaNLP or DaNE (which isn't what the paper was discussing) when you search for the tutorials.

rishub-tamirisa · 2022-07-01T18:11:59Z

Just uploaded a notebook that shows some results of SciBERT-FOBIE, see ( #3 )

rblumin24 · 2022-07-01T19:59:56Z

https://academic.oup.com/nar/article/43/W1/W535/2467892?login=true
https://academic.oup.com/nar/article/36/suppl_2/W399/2506595?login=false
https://academic.oup.com/bioinformatics/article/27/19/2721/231031

PolySearch is a classification method that BioNER used as a reference, so I looked into it and it seems pretty helpful. I'm still looking for code, though.

OrganismTagger is another classifier used by the creators of BioNER, and it categorizes biomedical words, which I thought could be helpful. Again, I have only been able to find articles, not code at the moment.

rishub-tamirisa · 2022-07-05T14:48:06Z

https://arxiv.org/pdf/2104.01364.pdf (SciBERT-CRF paper)
https://github.com/akashgnr31/Counts-And-Measurement (repo)

bruffridge · 2022-07-05T14:50:31Z

Georgia Institute of Technology has been researching using NLP to build Structure-Behavior-Function models from text.

IBID: https://dilab.gatech.edu/ibid/ (ongoing)
DANE: http://dilab.cc.gatech.edu/dane/ (past)

rishub-tamirisa · 2022-07-05T17:35:18Z

I just re-trained SciBERT-FOBIE on a cleaned version of the dataset. ( #6 )

However, token imbalance is still a big issue.

rblumin24 · 2022-07-07T18:11:43Z

Here's some more information on OrganismTagger: https://www.semanticsoftware.info/system/files/orgtagger-1.3a.pdf
And here's the BioNER repo: https://github.com/phil1995/BioNER

rishub-tamirisa · 2022-07-12T14:36:50Z

https://staff.science.uva.nl/c.monz/ltl/publications/mtsummit2017.pdf (Fine-tuning for translation models)

rishub-tamirisa · 2022-07-20T18:39:37Z

@bruffridge I originally thought that HuggingFace allowed you to use train a summarization model on any existing Language Model, but l realized that since models like SciBERT and BERT are encoder-only, you still need to train the summarization decoder from scratch. There are some existing proposed methods for using pretrained encoders for summarization, but none are generally implemented in existing API like HuggingFace, I would need to fork from one of these papers or implement it on my own. I'm still reading through COVIDSum: A linguistically enriched SciBERT-based summarization model
for COVID-19 scientific papers and Text Summarization with Pretrained Encoders to get a better idea of how exactly it works. People may also have uploaded some existing academic paper summarizers on HuggingFace, so I will check that as well.

In the meantime, since using SciBERT directly for summarization is non-trivial, I could still test the AskNature data on a more general english language summarization model, like BertSum or T5. The results likely won't be as good because of the number of low-frequency / rare terms present, but we can still see how well it works.

rishub-tamirisa · 2022-07-21T19:18:00Z

@bruffridge Google Colab works much better, thanks for the suggestion.

Model is currently training: Colab link

@hschilling I might not need to use AWS for this task after all! Although I'm still interested in getting it set up to learn more about how it works, if that's possible. After this model trains, I'd like to either try and implement the paper in the above comment for using SciBERT for text summarization, or fine-tune SciBERT as a language model directly on our existing corpora of biomimicry papers in Box. These might be more compute-intensive, in which case AWS resources might work better than Colab, which has 16 GB memory in the free version.

hschilling · 2022-07-21T19:29:53Z

@rishub-tamirisa ok, I did already request access to AWS for you

bruffridge · 2022-08-03T13:41:31Z

A few resources that may be helpful:

Explaining Relationships Between Scientific Documents
Entity, Relation, and Event Extraction with Contextualized Span Representations
Multi-Task Identification of Entities, Relations, and Coreference for Scientific Knowledge Graph Construction (SciERC dataset)

bruffridge · 2022-08-03T14:01:47Z

A few potentially useful datasets for Scientific NER:

bruffridge · 2022-08-03T16:20:26Z

This abstract describes two separate species (actor), function (what), mechanism (how), context sets which I've labelled below. We may need to build an annotated dataset to train and evaluate models on this task. With coreference resolution it may be possible to link the actor of the first triplet "birds" as referring to "Poorwills".

Compared to mammals, there are relatively few studies examining heterothermy in birds. In 13 bird families known to contain heterothermic species, the common poorwill (Phalaenoptilus nuttallii) is the only species that ostensibly hibernates. We used temperature-sensitive radio-transmitters to collect roost and skin temperature (Tskin) data, and winter roost preferences for free-ranging poorwills in southern Arizona. Further, to determine the effect of passive rewarming on torpor bout duration and {active rewarming}[what-1] (i.e., {the use of metabolic heat to increase Tskin}[how-1]), we experimentally shaded seven {birds}[actor-1] {during winter}[context-1] to prevent them from passively rewarming via solar radiation. {Poorwills}[actor-2] {selected winter roosts that were open to the south or southwest}[how-2], facilitating {passive solar warming}[what-2] {in the late afternoon}[context-2]. Shaded birds actively rewarmed following at least 3 days of continuous torpor. Average torpor bout duration by shaded birds was 122 h and ranged from 91 to 164 h. Active rewarming by shaded birds occurred on significantly warmer days than those when poorwills remained torpid. One shaded bird remained inactive for 45 days, during which it spontaneously rewarmed actively on eight occasions. Our findings show that during winter poorwills exhibit physiological patterns and active rewarming similar to hibernating mammals.

bruffridge · 2022-08-03T16:27:32Z

Another example. This one includes a sub-span for context within 'what'. Eventually we may want to break down these spans further into sub-spans. For example, one 'what' into two: 'regulate body temperature', and 'regulate brain temperature'. One 'how' into three: 'panting through the nose', 'panting through the mouth', and 'selective brain cooling'.

Reindeer (Rangifer tarandus) are protected against the Arctic winter cold by thick fur of prime insulating capacity and hence have few avenues of heat loss during work. We have investigated how these animals regulate brain temperature under heavy heat loads. Animals were instrumented for measurements of blood flow, tissue temperatures and respiratory frequency (f) under full anaesthesia, whereas measurements were also made in fully conscious animals while in a climatic chamber or running on a treadmill. At rest, brain temperature (Tbrain) rose from 38.5±0.1°C at 10°C to 39.5±0.2°C at 50°C, while f increased from ×7 to ×250 breaths min–1, with a change to open-mouth panting (OMP) at Tbrain 39.0±0.1°C, and carotid and sublingual arterial flows increased by 160% and 500%, respectively. OMP caused jugular venous and carotid arterial temperatures to drop, presumably owing to a much increased respiratory evaporative heat loss. Angular oculi vein (AOV) flow was negligible until Tbrain reached 38.9±0.1°C, but it increased to 0.81 ml min–1 kg–1 at Tbrain 39.2±0.2°C. Bilateral occlusion of both AOVs induced OMP and a rise in Tbrain and f at Tbrain >38.8°C. We propose that {reindeer}[actor] {regulate body and, particularly, brain temperature {under heavy heat loads}[context]}[what] by {a combination of panting, at first through the nose, but later, when the heat load and the minute volume requirements increase due to exercise, primarily through the mouth and that they eventually resort to selective brain cooling}[how].

bruffridge assigned abalai-ash May 31, 2022

bruffridge assigned rishub-tamirisa Jun 13, 2022

bruffridge transferred this issue from nasa-petal/PeTaL-labeller Jun 13, 2022

bruffridge assigned rblumin24 Jun 14, 2022

bruffridge assigned 4eshanb and unassigned abalai-ash, rishub-tamirisa and rblumin24 Sep 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identify and summarize biological strategies contained in research papers. #1

Identify and summarize biological strategies contained in research papers. #1

bruffridge commented May 31, 2022 •

edited

Loading

bruffridge commented Jun 13, 2022

bruffridge commented Jun 13, 2022

bruffridge commented Jun 14, 2022 •

edited

Loading

bruffridge commented Jun 14, 2022

bruffridge commented Jun 14, 2022

bruffridge commented Jun 14, 2022

rishub-tamirisa commented Jun 15, 2022 •

edited

Loading

hschilling commented Jun 15, 2022

bruffridge commented Jun 16, 2022

abalai-ash commented Jun 16, 2022 •

edited

Loading

rishub-tamirisa commented Jun 16, 2022

rishub-tamirisa commented Jun 16, 2022 •

edited by bruffridge

Loading

abalai-ash commented Jun 16, 2022

bruffridge commented Jun 17, 2022 •

edited

Loading

rishub-tamirisa commented Jun 17, 2022

bruffridge commented Jun 21, 2022

rishub-tamirisa commented Jun 23, 2022

bruffridge commented Jun 23, 2022

abalai-ash commented Jun 30, 2022

abalai-ash commented Jun 30, 2022

rishub-tamirisa commented Jul 1, 2022 •

edited

Loading

rblumin24 commented Jul 1, 2022 •

edited

Loading

rishub-tamirisa commented Jul 5, 2022

bruffridge commented Jul 5, 2022

rishub-tamirisa commented Jul 5, 2022

rblumin24 commented Jul 7, 2022

rishub-tamirisa commented Jul 12, 2022

rishub-tamirisa commented Jul 20, 2022

rishub-tamirisa commented Jul 21, 2022

hschilling commented Jul 21, 2022

bruffridge commented Aug 3, 2022 •

edited

Loading

bruffridge commented Aug 3, 2022 •

edited

Loading

bruffridge commented Aug 3, 2022 •

edited

Loading

bruffridge commented Aug 3, 2022 •

edited

Loading

Identify and summarize biological strategies contained in research papers. #1

Identify and summarize biological strategies contained in research papers. #1

Comments

bruffridge commented May 31, 2022 • edited Loading

bruffridge commented Jun 13, 2022

bruffridge commented Jun 13, 2022

bruffridge commented Jun 14, 2022 • edited Loading

bruffridge commented Jun 14, 2022

bruffridge commented Jun 14, 2022

bruffridge commented Jun 14, 2022

rishub-tamirisa commented Jun 15, 2022 • edited Loading

hschilling commented Jun 15, 2022

bruffridge commented Jun 16, 2022

abalai-ash commented Jun 16, 2022 • edited Loading

rishub-tamirisa commented Jun 16, 2022

rishub-tamirisa commented Jun 16, 2022 • edited by bruffridge Loading

abalai-ash commented Jun 16, 2022

bruffridge commented Jun 17, 2022 • edited Loading

rishub-tamirisa commented Jun 17, 2022

bruffridge commented Jun 21, 2022

rishub-tamirisa commented Jun 23, 2022

bruffridge commented Jun 23, 2022

abalai-ash commented Jun 30, 2022

abalai-ash commented Jun 30, 2022

rishub-tamirisa commented Jul 1, 2022 • edited Loading

rblumin24 commented Jul 1, 2022 • edited Loading

rishub-tamirisa commented Jul 5, 2022

bruffridge commented Jul 5, 2022

rishub-tamirisa commented Jul 5, 2022

rblumin24 commented Jul 7, 2022

rishub-tamirisa commented Jul 12, 2022

rishub-tamirisa commented Jul 20, 2022

rishub-tamirisa commented Jul 21, 2022

hschilling commented Jul 21, 2022

bruffridge commented Aug 3, 2022 • edited Loading

bruffridge commented Aug 3, 2022 • edited Loading

bruffridge commented Aug 3, 2022 • edited Loading

bruffridge commented Aug 3, 2022 • edited Loading

bruffridge commented May 31, 2022 •

edited

Loading

bruffridge commented Jun 14, 2022 •

edited

Loading

rishub-tamirisa commented Jun 15, 2022 •

edited

Loading

abalai-ash commented Jun 16, 2022 •

edited

Loading

rishub-tamirisa commented Jun 16, 2022 •

edited by bruffridge

Loading

bruffridge commented Jun 17, 2022 •

edited

Loading

rishub-tamirisa commented Jul 1, 2022 •

edited

Loading

rblumin24 commented Jul 1, 2022 •

edited

Loading

bruffridge commented Aug 3, 2022 •

edited

Loading

bruffridge commented Aug 3, 2022 •

edited

Loading

bruffridge commented Aug 3, 2022 •

edited

Loading

bruffridge commented Aug 3, 2022 •

edited

Loading