Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify and summarize biological strategies contained in research papers. #1

Open
bruffridge opened this issue May 31, 2022 · 34 comments
Assignees

Comments

@bruffridge
Copy link
Member

bruffridge commented May 31, 2022

The goal is to identify and summarize biological strategies found in research papers or other datasources, then abstract them into design strategies.

A biological strategy is a characteristic, mechanism, or process that an organism or ecosystem exhibits to accomplish a particular purpose or function within a particular context or conditions.

The main elements of a biological strategy are:

  • The organism or ecosystem
  • The part of the organism
  • Function (what it does or accomplishes)
  • Mechanism (how it does it)
  • Context (environment, conditions, constraints, stressors)

An example biological strategy:

Strategy: The harbor seal’s whiskers possess a specialized undulated surface structure that reduces vortex-induced vibrations as the whiskers move through water.
Organism: harbor seal
Part of: Whiskers
Function: reduce vibrations
Mechanism: a specialized undulated surface structure
Context: Moving through water

A bio-inspired design strategy is a statement that articulates the function, mechanism, and context without using biological terms. Instead biological terms are replaced with discipline-neutral synonyms (e.g. replace “fur” with “fibers,” or “skin” with “membrane”).

An example design strategy:

Strategy: While moving through a liquid, a small diameter fiber with an undulated surface structure reduces vortex-induced vibrations.

Inputs: raw text (such as the title and abstract of a biology journal article) #13
Outputs: Biological strategy, design strategy, Organism, Part of, Function, Mechanism, Context.

Early evaluation of large language models such as GPT-3 Davinci have shown promise for generating these outputs given a proper prompt and a few training examples (1-3).

AskNature has manually curated a list of biological strategies on its website, based on research papers. These may be helpful in training a machine learning model. #12

Open source large language models

  • OPT
  • GPT-J, GPT-NeoX-20B
  • Bloom

Commercial large language models

  • OpenAI GPT-3

Alternatives to large language models for text summarization (may not work as well)
https://aws.amazon.com/blogs/machine-learning/part-1-set-up-a-text-summarization-project-with-hugging-face-transformers/
https://www.projectpro.io/article/transformers-bart-model-explained/553#mcetoc_1fq07mh0qa
https://paperswithcode.com/sota/text-summarization-on-gigaword

@bruffridge
Copy link
Member Author

@bruffridge bruffridge transferred this issue from nasa-petal/PeTaL-labeller Jun 13, 2022
@bruffridge
Copy link
Member Author

Perhaps another way to think about the problem is named entity recognition, part-of-speech tagging may also be helpful. "a specialized undulated surface structure" - form, "reduce vibrations" - function, "moving through water" - context.

@bruffridge
Copy link
Member Author

bruffridge commented Jun 14, 2022

@bruffridge
Copy link
Member Author

Since multiple people will be working on this issue, it may be helpful to create different branches in the bio-strategy-extractor repository to track code and results for different evaluated methods. Also, please coordinate efforts so different approaches to solving the problem can be explored and results compared.

@bruffridge
Copy link
Member Author

A colleague just informed me of a paper entitled, "Categorizing biological information based on
function–morphology for bioinspired conceptual design". I uploaded it to the Literature folder in Box. Please review for potential application to this problem.

@bruffridge
Copy link
Member Author

A Question Answering model may be another approach worth looking at.
https://huggingface.co/models?pipeline_tag=question-answering&sort=downloads
https://paperswithcode.com/task/question-answering
https://github.com/sebastianruder/NLP-progress/blob/master/english/question_answering.md

For example, check out the results from this QA model when asked "What is the primary function?" and "What reduces vibrations?"

@rishub-tamirisa
Copy link
Member

rishub-tamirisa commented Jun 15, 2022

This paper, Rhetorical Sentence Categorization for Scientific Paper using Word2Vec Semantic Representation seems interesting. For enabling a model to actually locate where parts of a paper/abstract are describing biomimetic function, we could hand annotate a few known sentences and look for cosine similarity using Word2Vec between the labeled sentences and unlabeled sentences (comparing the averages of the vectors in each sentence).

I'm not sure if this would work well, but the reason something like this might be useful is that summarizing an abstract generally may not always result automatically identifying the specific biomimetic function(s) described.

Perhaps another way to think about the problem is named entity recognition, part-of-speech tagging may also be helpful. "a specialized undulated surface structure" - form, "reduce vibrations" - function, "moving through water" - context.

I agree that this could work well, probably by fine-tuning an existing NER model, but the challenge would be to create the training data.

@hschilling
Copy link

Rishub, for getting the training data, we do have access to Amazon Ground Truth, a labeling service, if that helps

@bruffridge
Copy link
Member Author

Semantic Role Labelling might be useful for extracting Who (form), What (function), Where (context):
https://paperswithcode.com/task/semantic-role-labeling
https://nlpprogress.com/english/semantic_role_labeling.html
https://web.stanford.edu/~jurafsky/slp3/19.pdf

@abalai-ash
Copy link
Contributor

abalai-ash commented Jun 16, 2022

This survey academic journal of summarization techniques may be of use:

A couple journals worth looking into:

More fine-tuning journals:

This might help with what I need to do. Not sure if this will be useful to anybody else:

@rishub-tamirisa
Copy link
Member

https://arxiv.org/pdf/2106.01592v1.pdf : Biomimicry AI overview

@rishub-tamirisa
Copy link
Member

rishub-tamirisa commented Jun 16, 2022

@abalai-ash
Copy link
Contributor

Steps for NLP Pipeline that we can implement in our algorithm after further literature research:

  1. Sentence segmentation: breaks the given paragraph into separate sentences.
  2. Word tokenization: extract the words from each sentence one by one.
  3. 'Parts of Speech' Prediction: identifying parts of speech.
  4. Text Lemmatization: figure out the most basic form of each word in a sentence. "Germ" and "Germs" can have two different meanings and we should look to solve that.
  5. 'Stop Words' Identification: English has a lot of filter words that appear very frequently and that introduces a lot of noise.
  6. Dependency Parsing: uses the grammatical laws to figure out how the words relate to one another.
  7. Entity Analysis: go through the text and identify all of the important words or “entities” in the text.
  8. Pronouns Parsing: keeps track of the pronouns with respect to the context of the sentence.

@bruffridge
Copy link
Member Author

bruffridge commented Jun 17, 2022

@rishub-tamirisa Good find on the biomimicry function identification paper. Here's a list of subsequent papers that cited this one, many of which appear to be relevant. https://www.lens.org/lens/scholar/article/088-258-820-290-519/citations/citing

Here's one in particular that looks interesting: http://ceur-ws.org/Vol-2831/paper4.pdf

@rishub-tamirisa
Copy link
Member

Thanks. That paper you linked does look interesting. "The preliminary results indicate that the ability to add ontologies to IBID allows it to extract meaning from new documents." I'm definitely going to take a look at the rest of it.

@bruffridge
Copy link
Member Author

Screen Shot 2022-06-21 at 10 58 56 AM
From Nagel: https://www.mdpi.com/2411-9660/2/4/47/htm

@rishub-tamirisa
Copy link
Member

https://arxiv.org/pdf/1909.07755.pdf : SpERT Paper

@bruffridge
Copy link
Member Author

It may be easier to focus initially on one function then expand the pipeline to include other functions. For example, the function "modify/convert thermal energy". Then the problem becomes trying to identify sections of text that describe managing thermal energy. Next comes identifying the "how"; or what does the text describe as the mechanism responsible for the management of thermal energy. Eventually we may want to classify these various "hows" or "strategies" into different categories (form, material, structure, process, or system).

@abalai-ash
Copy link
Contributor

@abalai-ash
Copy link
Contributor

I will look at more methods for unsupervised learning for entity recognition. I will also look into DANE tutorials. It looks like it is showing DaNLP or DaNE (which isn't what the paper was discussing) when you search for the tutorials.

@rishub-tamirisa
Copy link
Member

rishub-tamirisa commented Jul 1, 2022

Just uploaded a notebook that shows some results of SciBERT-FOBIE, see ( #3 )

@rblumin24
Copy link

rblumin24 commented Jul 1, 2022

https://academic.oup.com/nar/article/43/W1/W535/2467892?login=true
https://academic.oup.com/nar/article/36/suppl_2/W399/2506595?login=false
https://academic.oup.com/bioinformatics/article/27/19/2721/231031

PolySearch is a classification method that BioNER used as a reference, so I looked into it and it seems pretty helpful. I'm still looking for code, though.

OrganismTagger is another classifier used by the creators of BioNER, and it categorizes biomedical words, which I thought could be helpful. Again, I have only been able to find articles, not code at the moment.

@rishub-tamirisa
Copy link
Member

@bruffridge
Copy link
Member Author

Georgia Institute of Technology has been researching using NLP to build Structure-Behavior-Function models from text.

IBID: https://dilab.gatech.edu/ibid/ (ongoing)
DANE: http://dilab.cc.gatech.edu/dane/ (past)

@rishub-tamirisa
Copy link
Member

I just re-trained SciBERT-FOBIE on a cleaned version of the dataset. ( #6 )

However, token imbalance is still a big issue.

@rblumin24
Copy link

Here's some more information on OrganismTagger: https://www.semanticsoftware.info/system/files/orgtagger-1.3a.pdf
And here's the BioNER repo: https://github.com/phil1995/BioNER

@rishub-tamirisa
Copy link
Member

https://staff.science.uva.nl/c.monz/ltl/publications/mtsummit2017.pdf (Fine-tuning for translation models)

@rishub-tamirisa
Copy link
Member

@bruffridge I originally thought that HuggingFace allowed you to use train a summarization model on any existing Language Model, but l realized that since models like SciBERT and BERT are encoder-only, you still need to train the summarization decoder from scratch. There are some existing proposed methods for using pretrained encoders for summarization, but none are generally implemented in existing API like HuggingFace, I would need to fork from one of these papers or implement it on my own. I'm still reading through COVIDSum: A linguistically enriched SciBERT-based summarization model
for COVID-19 scientific papers
and Text Summarization with Pretrained Encoders to get a better idea of how exactly it works. People may also have uploaded some existing academic paper summarizers on HuggingFace, so I will check that as well.

In the meantime, since using SciBERT directly for summarization is non-trivial, I could still test the AskNature data on a more general english language summarization model, like BertSum or T5. The results likely won't be as good because of the number of low-frequency / rare terms present, but we can still see how well it works.

@rishub-tamirisa
Copy link
Member

@bruffridge Google Colab works much better, thanks for the suggestion.

Model is currently training: Colab link

@hschilling I might not need to use AWS for this task after all! Although I'm still interested in getting it set up to learn more about how it works, if that's possible. After this model trains, I'd like to either try and implement the paper in the above comment for using SciBERT for text summarization, or fine-tune SciBERT as a language model directly on our existing corpora of biomimicry papers in Box. These might be more compute-intensive, in which case AWS resources might work better than Colab, which has 16 GB memory in the free version.

@hschilling
Copy link

@rishub-tamirisa ok, I did already request access to AWS for you

@bruffridge
Copy link
Member Author

bruffridge commented Aug 3, 2022

@bruffridge
Copy link
Member Author

bruffridge commented Aug 3, 2022

A few potentially useful datasets for Scientific NER:

@bruffridge
Copy link
Member Author

bruffridge commented Aug 3, 2022

This abstract describes two separate species (actor), function (what), mechanism (how), context sets which I've labelled below. We may need to build an annotated dataset to train and evaluate models on this task. With coreference resolution it may be possible to link the actor of the first triplet "birds" as referring to "Poorwills".

Compared to mammals, there are relatively few studies examining heterothermy in birds. In 13 bird families known to contain heterothermic species, the common poorwill (Phalaenoptilus nuttallii) is the only species that ostensibly hibernates. We used temperature-sensitive radio-transmitters to collect roost and skin temperature (Tskin) data, and winter roost preferences for free-ranging poorwills in southern Arizona. Further, to determine the effect of passive rewarming on torpor bout duration and {active rewarming}[what-1] (i.e., {the use of metabolic heat to increase Tskin}[how-1]), we experimentally shaded seven {birds}[actor-1] {during winter}[context-1] to prevent them from passively rewarming via solar radiation. {Poorwills}[actor-2] {selected winter roosts that were open to the south or southwest}[how-2], facilitating {passive solar warming}[what-2] {in the late afternoon}[context-2]. Shaded birds actively rewarmed following at least 3 days of continuous torpor. Average torpor bout duration by shaded birds was 122 h and ranged from 91 to 164 h. Active rewarming by shaded birds occurred on significantly warmer days than those when poorwills remained torpid. One shaded bird remained inactive for 45 days, during which it spontaneously rewarmed actively on eight occasions. Our findings show that during winter poorwills exhibit physiological patterns and active rewarming similar to hibernating mammals.

@bruffridge
Copy link
Member Author

bruffridge commented Aug 3, 2022

Another example. This one includes a sub-span for context within 'what'. Eventually we may want to break down these spans further into sub-spans. For example, one 'what' into two: 'regulate body temperature', and 'regulate brain temperature'. One 'how' into three: 'panting through the nose', 'panting through the mouth', and 'selective brain cooling'.

Reindeer (Rangifer tarandus) are protected against the Arctic winter cold by thick fur of prime insulating capacity and hence have few avenues of heat loss during work. We have investigated how these animals regulate brain temperature under heavy heat loads. Animals were instrumented for measurements of blood flow, tissue temperatures and respiratory frequency (f) under full anaesthesia, whereas measurements were also made in fully conscious animals while in a climatic chamber or running on a treadmill. At rest, brain temperature (Tbrain) rose from 38.5±0.1°C at 10°C to 39.5±0.2°C at 50°C, while f increased from ×7 to ×250 breaths min–1, with a change to open-mouth panting (OMP) at Tbrain 39.0±0.1°C, and carotid and sublingual arterial flows increased by 160% and 500%, respectively. OMP caused jugular venous and carotid arterial temperatures to drop, presumably owing to a much increased respiratory evaporative heat loss. Angular oculi vein (AOV) flow was negligible until Tbrain reached 38.9±0.1°C, but it increased to 0.81 ml min–1 kg–1 at Tbrain 39.2±0.2°C. Bilateral occlusion of both AOVs induced OMP and a rise in Tbrain and f at Tbrain >38.8°C. We propose that {reindeer}[actor] {regulate body and, particularly, brain temperature {under heavy heat loads}[context]}[what] by {a combination of panting, at first through the nose, but later, when the heat load and the minute volume requirements increase due to exercise, primarily through the mouth and that they eventually resort to selective brain cooling}[how].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

6 participants