E-commerce data assistant: Injecting knowledge in seq2seq SQL generation

This code is based on:

@inproceedings{Scholak2021:PICARD,
  author = {Torsten Scholak and Nathan Schucher and Dzmitry Bahdanau},
  title = {PICARD - Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models},
  booktitle = {Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing},
  year = {2021},
  publisher = {Association for Computational Linguistics},
}

Prerequisites

This repository uses git submodules. Clone it like this:

$ git clone [email protected]:elena-soare/picard.git
$ cd picard
$ git submodule update --init --recursive

This will create a docker image:

$ make build-eval-image

!Warning: building the image takes around 2 hours the first time.

Training

The training script is located in seq2seq/run_seq2seq.py.

Before running it, you need to create a docker image manually using the Prerequisite command from above. Then you need to manually change the image within 'docker run' from the Makefile to your docker image. Then, you can simply run:

$ make train

The model will be trained on the Spider dataset by default.

To run it on Datasaur, change in configs/train.json the field 'model-name-or-path' to 'datasaur'

To enable Foreign Keys Serialization, set in the same config file to 'schema_serialization_with_foreign_keys' to true

To enable Schema Augumentation, set in the same config file to 'schema_serialization_with_db_description' to true

Evaluation

The evaluation script is located in seq2seq/run_seq2seq.py. You can run it with:

$ make eval

By default, the evaluation will be run on the Spider evaluation set.

The default configuration is stored in configs/eval.json.

To enable Foreign Keys Serialization, set "schema_serialization_with_foreign_keys" to true and add the corresponding model to the Huggingface fine-tuned model 'elena-soare/bat-fk-base'

To run the Pre-trained E-commerce model, leave the baseline configurations and set the model name or path to the Huggingface checkpoint (the default one)

To enable Schema Augumentation, set "schema_serialization_with_db_description" to true and the corresponding model to the Huggingface fine-tuned model

Serving

A trained model can be served using the seq2seq/serve_seq2seq.py script.

The configuration file can be found in configs/serve.json.

You can start serving with:

$ make serve

Pre-training Script

Pre-training script is located in /picard/pre-training/script.ipynb. You need to set the file with crawled data (https://huggingface.co/datasets/elena-soare/crawled-ecommerce/blob/main/train.json) within the same directory to run the code.

The files used to crawl e-commerce Common Crawl data are in /picard/pre-training/crawling_data/. To run it, you would need an AWS account, set up AWS Athena columnar index (https://skeptric.com/common-crawl-index-athena/ -> tutorial). And obtain a file with the Authentication Credentials.

Build virtualenv and install requirements:

pip3 install -r requirements

Paste credentials.csv into complex_task_search parent directory for AWS credentials (region should be us-east-1 to be co-located with common crawl data). Format:

Access key ID,Secret access key,staging path,region
XXX,YYY,ZZZ,us-east-1

The function run_build_corpus() in common_crawl.py should connect to AWS Athena and crawl the data. To convert html pages to text and santize the crawled data, you can find the code in html_to_text.py.

Name		Name	Last commit message	Last commit date
Latest commit History 196 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
.vscode		.vscode
configs		configs
database/sales		database/sales
gen-cpp2		gen-cpp2
gen-hs		gen-hs
gen-py3		gen-py3
picard		picard
pre_training		pre_training
seq2seq		seq2seq
tests		tests
third_party		third_party
.gitignore		.gitignore
.gitmodules		.gitmodules
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
beam_search_with_picard.svg		beam_search_with_picard.svg
cabal.project		cabal.project
fb-util-cabal.patch		fb-util-cabal.patch
hie.yaml		hie.yaml
picard.thrift		picard.thrift
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
results		results
results_sql.csv		results_sql.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

E-commerce data assistant: Injecting knowledge in seq2seq SQL generation

Prerequisites

Training

Evaluation

Serving

Pre-training Script

About

Releases

Packages

Languages

License

elena-soare/picard

Folders and files

Latest commit

History

Repository files navigation

E-commerce data assistant: Injecting knowledge in seq2seq SQL generation

Prerequisites

Training

Evaluation

Serving

Pre-training Script

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages