This code is based on:
@inproceedings{Scholak2021:PICARD,
author = {Torsten Scholak and Nathan Schucher and Dzmitry Bahdanau},
title = {PICARD - Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models},
booktitle = {Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing},
year = {2021},
publisher = {Association for Computational Linguistics},
}
This repository uses git submodules. Clone it like this:
$ git clone [email protected]:elena-soare/picard.git
$ cd picard
$ git submodule update --init --recursive
This will create a docker image:
$ make build-eval-image
!Warning: building the image takes around 2 hours the first time.
The training script is located in seq2seq/run_seq2seq.py
.
Before running it, you need to create a docker image manually using the Prerequisite command from above. Then you need to manually change the image within 'docker run' from the Makefile to your docker image. Then, you can simply run:
$ make train
The model will be trained on the Spider dataset by default.
To run it on Datasaur, change in configs/train.json the field 'model-name-or-path' to 'datasaur'
To enable Foreign Keys Serialization, set in the same config file to 'schema_serialization_with_foreign_keys' to true
To enable Schema Augumentation, set in the same config file to 'schema_serialization_with_db_description' to true
The evaluation script is located in seq2seq/run_seq2seq.py
.
You can run it with:
$ make eval
By default, the evaluation will be run on the Spider evaluation set.
The default configuration is stored in configs/eval.json
.
To enable Foreign Keys Serialization, set "schema_serialization_with_foreign_keys" to true and add the corresponding model to the Huggingface fine-tuned model 'elena-soare/bat-fk-base'
To run the Pre-trained E-commerce model, leave the baseline configurations and set the model name or path to the Huggingface checkpoint (the default one)
To enable Schema Augumentation, set "schema_serialization_with_db_description" to true and the corresponding model to the Huggingface fine-tuned model
A trained model can be served using the seq2seq/serve_seq2seq.py
script.
The configuration file can be found in configs/serve.json
.
You can start serving with:
$ make serve
Pre-training script is located in /picard/pre-training/script.ipynb. You need to set the file with crawled data (https://huggingface.co/datasets/elena-soare/crawled-ecommerce/blob/main/train.json) within the same directory to run the code.
The files used to crawl e-commerce Common Crawl data are in /picard/pre-training/crawling_data/. To run it, you would need an AWS account, set up AWS Athena columnar index (https://skeptric.com/common-crawl-index-athena/ -> tutorial). And obtain a file with the Authentication Credentials.
Build virtualenv and install requirements:
pip3 install -r requirements
Paste credentials.csv
into complex_task_search
parent directory for AWS credentials (region should be us-east-1
to be co-located with common crawl data). Format:
Access key ID,Secret access key,staging path,region
XXX,YYY,ZZZ,us-east-1
The function run_build_corpus() in common_crawl.py should connect to AWS Athena and crawl the data. To convert html pages to text and santize the crawled data, you can find the code in html_to_text.py.