Scripts to scrape and post-process the pdfs of Parlament de Catalunya
This repository consists a set of tools to process the plenary sessions of the Parlament de Catalunya.
The scripts are capable of
- Facilitating the metadata retrieval from the website of Parlament de Catalunya
- Conversion of pdfs of the parliamentary sessions into xml format and structure them into dictionaries
- Matching the session metadata from the audiovisuals with the structured data which come from the pdf text
in order to output a set of json files with each having the matched information of session, speaker, text and media url.
The detailed documentation for using the scripts not yet prepared
These scripts were developed in order to create the ParlamentParla corpus which was possible thanks to the support of Culture Department of the Catalan autonomous government.