This suite of software will prepare a data set called "CORD-19" for processing with the Distant Reader.
CORD-19 is a set of more than 50,000 full text scholarly journal articles surrounding the topic of COVID-19. Each "article" is really a JSON file containing (very) rudimentary bibliographic information, a set of paragraphs, and bibliographic citations. As a pre-processing step for the Distant Reader, the suite processes the CORD-19 metadata and its associated JSON files.
To get this software to work for you, pip install -r requirements.txt
, configure ./bin/cache.sh
, and the run ./bin/build.sh
. The system will then:
- download a zip file and its associated metadata file
- uncompress the the zip file
- move all the JSON files to a single directory
- initialize a database
- pour the metadata into the the database
- output a simple narrative report summarizing the content of the metadata file
Depending on the network connection, the build process takes less than 7 minutes.
The next steps are the creation of two scripts:
- Given an SQL SELECT statement, return a list of keys, and use them to initialize a Distant Reader study carrel
- Given a JSON file, output a more human-readable version of the same
Wish us luck.
Eric Lease Morgan <[email protected]>
May 14, 2020