The INV-CDIP Dataset

Introduction

This is the INV-CDIP dataset used in our paper "Field Extraction from Forms with Unlabeled Data". The dataset contains an unlabeled set and a labeled set. The unlabeled set contains around 200k invoices. The labeled set contains 350 invoices with 7 field annotated including invoice number, purchase order, invoice date, due date, amount due, total amount and total tax.

Download Dataset

Document ids are stored in train_set.txt and test_set.txt.
You may browse a document using https://www.industrydocuments.ucsf.edu/docs/$document_id.
Use the following script to download data automatically.

#install packages
bash install.sh
#download labeled data
python download_data.py --download --split labeled
#download unlabeled data
python download_data.py --download --split unlabeled

Annotation Description

Annotations are in ./annotation folder.
- In each json file, field label is annotated in ['Fields']['value']['label'].
- Field value is annotated in ['Fields']['value']['tag'].
- Field value location is annotated in ['Fields']['value']['bbox'].
- The key of a field value is annotated in ['Fields']['key']['tag'].
- Key location is annotated in ['Fields']['key']['bbox'].
Use the following script to visualize the annotations.

#visualize annotations
python download_data.py --vis

Citation

Please cite our paper if you use this dataset.

@article{gao2021field,
  title={Field Extraction from Forms with Unlabeled Data},
  author={Gao, Mingfei and Chen, Zeyuan and Naik, Nikhil and Hashimoto, Kazuma and Xiong, Caiming and Xu, Ran},
  journal={ACL Spa-NLP Workshop},
  year={2022}
}

License

The INV-CDIP dataset is released under CC BY-NC 4.0. The underlying documents to which the dataset refers are from the Tobacco Collections of Industry Documents Library. Please see Copyright and Fair Use for more information.

Contact

Please contact [email protected] if you have questions.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
annotation		annotation
.DS_Store		.DS_Store
.gitignore		.gitignore
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE.txt		LICENSE.txt
README.md		README.md
SECURITY.md		SECURITY.md
download_data.py		download_data.py
install.sh		install.sh
requirements.txt		requirements.txt
test_set.txt		test_set.txt
train_set.txt		train_set.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The INV-CDIP Dataset

Introduction

Download Dataset

Annotation Description

Citation

License

Contact

About

Releases

Packages

Contributors 2

Languages

License

salesforce/inv-cdip

Folders and files

Latest commit

History

Repository files navigation

The INV-CDIP Dataset

Introduction

Download Dataset

Annotation Description

Citation

License

Contact

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages