Learning-to-understand-scientific-experiments

Overview

The goal of the project is to parse scientific experiment texts (recipes) into machine readable format. Proposed attention-based copy-pointer and generate methods for relation extraction using k-nearest neighbors. Incorporated Sci-BERT and multi-layer perceptron models to achieve benchmarks results on wet labs corpus.

Poster: https://drive.google.com/open?id=14SFFdDdXgjB6iRXraWw5IvQQWz09oAUp

Crawling

python crawl.py

Nearest neighbors in tf-idf space

python build_index.py
python test_nn.py replaced_tfidf.pkl replaced.annoy replaced_sentences.txt original_sentences.txt

Test upper bound performance

python test_upperbound.py
python uppperbound_recall.py

Baselines

Relation Frequency

python baselines/relation_counts.py wetlabs_train.json wetlabs_val.json wetlabs_test.json

Majority Class prediction

python baselines/majority_class.py wetlabs_train.json wetlabs_val.json

Retrieve and Edit Model

To Preprocess protocols data. Needs WLP-Dataset

python preprocess_wetlabs.py

This creates 3 files: wetlabs_train.json, wetlabs_val.json, wetlabs_test.json

Using wetlabs_train.json to build annoy files and vectorizer files for KNN retrieval

python build_index_from_json.py <tfidf/scibert>

This creates files original.annoy, replaced.annoy, original_tfidf.pkl and replaced_tfidf.pkl. Annoy indices allow fast nearest neighbor search, tfidf.pkl store the idf components and the vocabulary which is required to vectorize any test sentence.

Generate embeddings of Sentences using BERT (or bioBERT).

python copy_model/prepare_data.py wetlabs_train.json <scibert/biobert> train_embeddings.pkl
python copy_model/prepare_data.py wetlabs_val.json <scibert/biobert> val_embeddings.pkl
python copy_model/prepare_data.py wetlabs_test.json <scibert/biobert> test_embeddings.pkl

This produces train_embeddings.pkl, val_embeddings.pkl, test_embeddings.pkl. Use argument (biobert) if want to generate embeddings from bioBERT and (bert) to use basebert.

Using pre-generated annoy, vectorizer, BERT embeddings to pre-compute nearest neighbors (for edit model)

python copy_model/prepare_context.py replaced.annoy replaced_tfidf.pkl train_embeddings.pkl train_embeddings.pkl train.pkl tfidf 4 5
python copy_model/prepare_context.py replaced.annoy replaced_tfidf.pkl train_embeddings.pkl test_embeddings.pkl test.pkl tfidf 4 5
python copy_model/prepare_context.py replaced.annoy replaced_tfidf.pkl train_embeddings.pkl val_embeddings.pkl val.pkl tfidf 4 5

OR

python copy_model/prepare_context.py original.annoy original_bert.pkl train_embeddings.pkl train_embeddings.pkl train.pkl scibert 4 5
python copy_model/prepare_context.py original.annoy original_bert.pkl train_embeddings.pkl test_embeddings.pkl test.pkl scibert 4 5
python copy_model/prepare_context.py original.annoy original_bert.pkl train_embeddings.pkl val_embeddings.pkl val.pkl scibert 4 5

This computes nearest neighbors in the training set for each sentence in the train, val and test sets. This writes into train.pkl, val.pkl, test.pkl.

Training using copy and generate mode

python copy_model/train.py --copy --generate --traindata PATH/TO/train.pkl --valdata PATH/TO/val.pkl --model_path OUTPUTDIR/model.pt

Training using copy mode

python copy_model/train.py --copy --no-generate --traindata PATH/TO/train.pkl --valdata PATH/TO/val.pkl --model_path OUTPUTDIR/model.pt

Training using generate mode

python copy_model/train.py --generate --no-copy --traindata PATH/TO/train.pkl --valdata PATH/TO/val.pkl --model_path OUTPUTDIR/model.pt

Test model

Set generate and copy arguments according to how training was done. For example:

python test.py --generate --no-copy --valdata val.pkl --model_path models/generate.pt --test_output_path generate
python test.py --no-generate --copy --valdata val.pkl --model_path models/copy.pt --test_output_path copy
python test.py --generate --copy --valdata val.pkl --model_path models/copy_generate.pt --test_output_path copy_generate

Name		Name	Last commit message	Last commit date
Latest commit History 136 Commits
Data		Data
WLP-Dataset @ 1f51ae6		WLP-Dataset @ 1f51ae6
baselines		baselines
bratreader		bratreader
conll_wetlab		conll_wetlab
conll_wlpdata		conll_wlpdata
copy_model		copy_model
extras		extras
standoff2conll		standoff2conll
wlpdata_split		wlpdata_split
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
build_index.py		build_index.py
build_index_from_json.py		build_index_from_json.py
build_index_from_json_small.py		build_index_from_json_small.py
complete_ann.py		complete_ann.py
get_cross_sentence.py		get_cross_sentence.py
get_num_args.py		get_num_args.py
original_sentences.txt		original_sentences.txt
original_tfidf.pkl		original_tfidf.pkl
preprocess_wetlabs.py		preprocess_wetlabs.py
recall_clear_upperbound.png		recall_clear_upperbound.png
replaced_sentences.txt		replaced_sentences.txt
replaced_tfidf.pkl		replaced_tfidf.pkl
test_upperbound.py		test_upperbound.py
test_upperbound_small.py		test_upperbound_small.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Learning-to-understand-scientific-experiments

Overview

Crawling

Nearest neighbors in tf-idf space

Test upper bound performance

Baselines

Relation Frequency

Majority Class prediction

Retrieve and Edit Model

To Preprocess protocols data. Needs WLP-Dataset

Using wetlabs_train.json to build annoy files and vectorizer files for KNN retrieval

Generate embeddings of Sentences using BERT (or bioBERT).

Using pre-generated annoy, vectorizer, BERT embeddings to pre-compute nearest neighbors (for edit model)

OR

Training using copy and generate mode

Training using copy mode

Training using generate mode

Test model

About

Releases

Packages

Contributors 2

Languages

daivikswarup/Learning-to-understand-scientific-experiments

Folders and files

Latest commit

History

Repository files navigation

Learning-to-understand-scientific-experiments

Overview

Crawling

Nearest neighbors in tf-idf space

Test upper bound performance

Baselines

Relation Frequency

Majority Class prediction

Retrieve and Edit Model

To Preprocess protocols data. Needs WLP-Dataset

Using wetlabs_train.json to build annoy files and vectorizer files for KNN retrieval

Generate embeddings of Sentences using BERT (or bioBERT).

Using pre-generated annoy, vectorizer, BERT embeddings to pre-compute nearest neighbors (for edit model)

OR

Training using copy and generate mode

Training using copy mode

Training using generate mode

Test model

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages