USPTO patents dataset generator.
sudo yum install python-devel libxslt-devel libxml2-devel
pip install patent-parsing-tools
Downloading dataset:
python -m patent_parsing_tools.downloader \
--directory dataset \
--year-from 2010 \
--year-to 2010
Collecting and serializing data:
python -m patent_parsing_tools.supervisor \
--working-directory patents/working_directory \
--train-destination patents/train_destination \
--test-destination patents/test_destination \
--year-from 2014 \
--year-to 2015
Generating dictionary with train set:
python -m patent_parsing_tools.bow.dictionary_maker \
--train-directory patents/train_destination \
--max-patents 1000000000 \
--dictionary dictionary.txt \
--dict-max-size 4096
Generate bag of words with train set and test set:
python -m patent_parsing_tools.bow.bag_of_words \
--serialized-patents patents/train_destination \
--destination-directory patents/final_dataset_train \
--dictionary dictionary.txt \
--batch-size 1048576
python -m patent_parsing_tools.bow.bag_of_words \
--serialized-patents patents/test_destination \
--destination-directory patents/final_dataset_test \
--dictionary dictionary.txt \
--batch-size 1048576
pytest
$ mkvirtualenv ppt
$ workon ppt
(ppt) $ pip install -r requirements.txt
$ git tag v1.0
$ git push origin v1.0
(ppt) $ sphinx-build -M html docs docs_build
Usage:
- Elton, Using natural language processing techniques to extract information on the properties and functionalities of energetic materials from large text corpora, 2019, online: https://arxiv.org/abs/1903.00415.
- Lee, Natural Language Processing Techniques for Advancing Materials Discovery: A Short Review, 2023, online: https://doi.org/10.1007/s40684-023-00523-6.
The MIT License (MIT). Copyright (c) 2014 Michał Dul, Piotr Przetacznik, Krzysztof Strojny. Check LICENSE files for more information.