This CLI tool automatically generates PDF bookmarks (also known as an 'outline' or a 'table of contents') for computer-generated PDF documents.
You can install it globally via pip:
pip install --user pdf_scout
pdf_scout ./my_document.pdf
pip uninstall pdf_scout
This project is a work in progress and will likely only generate suitable bookmarks for documents that conform to the following requirements:
- Single column of text (not multiple columns)
- Font size of header text > font size of body text
- Header text is justified or left-aligned
- Paragraph spacing for headers > body text paragraph spacing
- Consistent left margins on every page
pdf_scout
expressly seeks to supports the following classes of documents:
- Singapore State Court and Supreme Court Judgments (unreported)
- Singapore Law Reports
OpenDoc-generated PDFs, such as the State Court Practice Directions 2021 and the Supreme Court Practice Directions 2021– OpenDoc has been deprecated by GovTech
It may support other types of documents as well. If a particular class of document isn't supported or does not work well, please open an issue and I will consider adding support for it.
This project manages its dependencies using poetry and is only supported for Python ^3.9. After installing poetry and entering the project folder, run the following to install the dependencies:
poetry install
To open a virtualenv in the project folder with the dependencies, run:
poetry shell
To run a script directly, run:
poetry run python ./pdf_scout/app.py <INPUT_FILE_PATH>
Debugging using VSCode:
python -m debugpy --listen 0.0.0.0:5678 --wait-for-client ./pdf_scout/app.py
There are snapshot tests. Input PDFs are not provided at the moment, so you will have to populate the /pdf
folder manually using the relevant sources (you may want to consider using Clerkent to download the unreported versions of judgments):
poetry run pytest
poetry run pytest --snapshot-update
poetry run mypy pdf_scout/app.py
- Processing a large PDF can take some time, so to iterate faster when debugging certain behaviour, extract the problematic part of the PDF as a separate file