Skip to content

PICARD - Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models

License

Notifications You must be signed in to change notification settings

elena-soare/picard

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

E-commerce data assistant: Injecting knowledge in seq2seq SQL generation

This code is based on:

@inproceedings{Scholak2021:PICARD,
  author = {Torsten Scholak and Nathan Schucher and Dzmitry Bahdanau},
  title = {PICARD - Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models},
  booktitle = {Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing},
  year = {2021},
  publisher = {Association for Computational Linguistics},
}

Prerequisites

This repository uses git submodules. Clone it like this:

$ git clone [email protected]:elena-soare/picard.git
$ cd picard
$ git submodule update --init --recursive

This will create a docker image:

$ make build-eval-image

!Warning: building the image takes around 2 hours the first time.

Training

The training script is located in seq2seq/run_seq2seq.py.

Before running it, you need to create a docker image manually using the Prerequisite command from above. Then you need to manually change the image within 'docker run' from the Makefile to your docker image. Then, you can simply run:

$ make train

The model will be trained on the Spider dataset by default.

To run it on Datasaur, change in configs/train.json the field 'model-name-or-path' to 'datasaur'

To enable Foreign Keys Serialization, set in the same config file to 'schema_serialization_with_foreign_keys' to true

To enable Schema Augumentation, set in the same config file to 'schema_serialization_with_db_description' to true

Evaluation

The evaluation script is located in seq2seq/run_seq2seq.py. You can run it with:

$ make eval

By default, the evaluation will be run on the Spider evaluation set.

The default configuration is stored in configs/eval.json.

To enable Foreign Keys Serialization, set "schema_serialization_with_foreign_keys" to true and add the corresponding model to the Huggingface fine-tuned model 'elena-soare/bat-fk-base'

To run the Pre-trained E-commerce model, leave the baseline configurations and set the model name or path to the Huggingface checkpoint (the default one)

To enable Schema Augumentation, set "schema_serialization_with_db_description" to true and the corresponding model to the Huggingface fine-tuned model

Serving

A trained model can be served using the seq2seq/serve_seq2seq.py script.

The configuration file can be found in configs/serve.json.

You can start serving with:

$ make serve

Pre-training Script

Pre-training script is located in /picard/pre-training/script.ipynb. You need to set the file with crawled data (https://huggingface.co/datasets/elena-soare/crawled-ecommerce/blob/main/train.json) within the same directory to run the code.

The files used to crawl e-commerce Common Crawl data are in /picard/pre-training/crawling_data/. To run it, you would need an AWS account, set up AWS Athena columnar index (https://skeptric.com/common-crawl-index-athena/ -> tutorial). And obtain a file with the Authentication Credentials.

Build virtualenv and install requirements:

pip3 install -r requirements 

Paste credentials.csv into complex_task_search parent directory for AWS credentials (region should be us-east-1 to be co-located with common crawl data). Format:

Access key ID,Secret access key,staging path,region
XXX,YYY,ZZZ,us-east-1

The function run_build_corpus() in common_crawl.py should connect to AWS Athena and crawl the data. To convert html pages to text and santize the crawled data, you can find the code in html_to_text.py.

About

PICARD - Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 95.9%
  • Haskell 2.1%
  • Python 1.9%
  • Other 0.1%