Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent error loading pdf: Added pymupdf/fitz open to get page count #225

Merged
merged 1 commit into from
Jan 22, 2024
Merged

Conversation

ekcomputer
Copy link
Contributor

When adding some pdfs to the Docs class, I often received the following error (in case it matters, they were docs returned by zotero.iterate). Further inspection revealed pypdf was having trouble parsing the pdf and the pypdf.PdfReader returned an object that couldn't be evaluated with the len function, which is required for the utils.count_pdf_pages function.

Although paper-qa reads pdf text with fitz by default, the utils.count_pdf_pages function is only written for pypdf. Therefore, my pull request simply modifies it to use fitz there as well.

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[15], [line 3](vscode-notebook-cell:?execution_count=15&line=3)
      [1](vscode-notebook-cell:?execution_count=15&line=1) ## Try again with only the Zotero part
      [2](vscode-notebook-cell:?execution_count=15&line=2) zotero = ZoteroDB(library_type="user")  # "group" if group library
----> [3](vscode-notebook-cell:?execution_count=15&line=3) for item in zotero.iterate(collection_name='Modeling', limit=50):
      [4](vscode-notebook-cell:?execution_count=15&line=4)     # print(f'Error parsing page count for: {item.title}')
      [5](vscode-notebook-cell:?execution_count=15&line=5)     try:
      [6](vscode-notebook-cell:?execution_count=15&line=6)         print(f' >    {item.title}')

File [~/mambaforge/envs/paperqa/lib/python3.11/site-packages/paperqa/contrib/zotero.py:259](https://file+.vscode-resource.vscode-cdn.net/Users/ekyzivat/Library/CloudStorage/Dropbox/Python/LLM-packages/~/mambaforge/envs/paperqa/lib/python3.11/site-packages/paperqa/contrib/zotero.py:259), in ZoteroDB.iterate(self, limit, start, q, qmode, since, tag, sort, direction, collection_name)
    [253](https://file+.vscode-resource.vscode-cdn.net/Users/ekyzivat/Library/CloudStorage/Dropbox/Python/LLM-packages/~/mambaforge/envs/paperqa/lib/python3.11/site-packages/paperqa/contrib/zotero.py:253) title = item["data"]["title"] if "title" in item["data"] else ""
    [254](https://file+.vscode-resource.vscode-cdn.net/Users/ekyzivat/Library/CloudStorage/Dropbox/Python/LLM-packages/~/mambaforge/envs/paperqa/lib/python3.11/site-packages/paperqa/contrib/zotero.py:254) if len(items) >= start:
    [255](https://file+.vscode-resource.vscode-cdn.net/Users/ekyzivat/Library/CloudStorage/Dropbox/Python/LLM-packages/~/mambaforge/envs/paperqa/lib/python3.11/site-packages/paperqa/contrib/zotero.py:255)     yield ZoteroPaper(
    [256](https://file+.vscode-resource.vscode-cdn.net/Users/ekyzivat/Library/CloudStorage/Dropbox/Python/LLM-packages/~/mambaforge/envs/paperqa/lib/python3.11/site-packages/paperqa/contrib/zotero.py:256)         key=_get_citation_key(item),
    [257](https://file+.vscode-resource.vscode-cdn.net/Users/ekyzivat/Library/CloudStorage/Dropbox/Python/LLM-packages/~/mambaforge/envs/paperqa/lib/python3.11/site-packages/paperqa/contrib/zotero.py:257)         title=title,
    [258](https://file+.vscode-resource.vscode-cdn.net/Users/ekyzivat/Library/CloudStorage/Dropbox/Python/LLM-packages/~/mambaforge/envs/paperqa/lib/python3.11/site-packages/paperqa/contrib/zotero.py:258)         pdf=pdf,
--> [259](https://file+.vscode-resource.vscode-cdn.net/Users/ekyzivat/Library/CloudStorage/Dropbox/Python/LLM-packages/~/mambaforge/envs/paperqa/lib/python3.11/site-packages/paperqa/contrib/zotero.py:259)         num_pages=count_pdf_pages(pdf),
    [260](https://file+.vscode-resource.vscode-cdn.net/Users/ekyzivat/Library/CloudStorage/Dropbox/Python/LLM-packages/~/mambaforge/envs/paperqa/lib/python3.11/site-packages/paperqa/contrib/zotero.py:260)         details=item,
    [261](https://file+.vscode-resource.vscode-cdn.net/Users/ekyzivat/Library/CloudStorage/Dropbox/Python/LLM-packages/~/mambaforge/envs/paperqa/lib/python3.11/site-packages/paperqa/contrib/zotero.py:261)         zotero_key=item["key"],
    [262](https://file+.vscode-resource.vscode-cdn.net/Users/ekyzivat/Library/CloudStorage/Dropbox/Python/LLM-packages/~/mambaforge/envs/paperqa/lib/python3.11/site-packages/paperqa/contrib/zotero.py:262)     )
    [263](https://file+.vscode-resource.vscode-cdn.net/Users/ekyzivat/Library/CloudStorage/Dropbox/Python/LLM-packages/~/mambaforge/envs/paperqa/lib/python3.11/site-packages/paperqa/contrib/zotero.py:263)     actual_i += 1
    [265](https://file+.vscode-resource.vscode-cdn.net/Users/ekyzivat/Library/CloudStorage/Dropbox/Python/LLM-packages/~/mambaforge/envs/paperqa/lib/python3.11/site-packages/paperqa/contrib/zotero.py:265) items.append(item)

File [~/mambaforge/envs/paperqa/lib/python3.11/site-packages/paperqa/utils.py:68](https://file+.vscode-resource.vscode-cdn.net/Users/ekyzivat/Library/CloudStorage/Dropbox/Python/LLM-packages/~/mambaforge/envs/paperqa/lib/python3.11/site-packages/paperqa/utils.py:68), in count_pdf_pages(file_path)
     [66](https://file+.vscode-resource.vscode-cdn.net/Users/ekyzivat/Library/CloudStorage/Dropbox/Python/LLM-packages/~/mambaforge/envs/paperqa/lib/python3.11/site-packages/paperqa/utils.py:66) with open(file_path, "rb") as pdf_file:
     [67](https://file+.vscode-resource.vscode-cdn.net/Users/ekyzivat/Library/CloudStorage/Dropbox/Python/LLM-packages/~/mambaforge/envs/paperqa/lib/python3.11/site-packages/paperqa/utils.py:67)     pdf_reader = pypdf.PdfReader(pdf_file)
---> [68](https://file+.vscode-resource.vscode-cdn.net/Users/ekyzivat/Library/CloudStorage/Dropbox/Python/LLM-packages/~/mambaforge/envs/paperqa/lib/python3.11/site-packages/paperqa/utils.py:68)     num_pages = len(pdf_reader.pages)
     [69](https://file+.vscode-resource.vscode-cdn.net/Users/ekyzivat/Library/CloudStorage/Dropbox/Python/LLM-packages/~/mambaforge/envs/paperqa/lib/python3.11/site-packages/paperqa/utils.py:69) return num_pages

File [~/mambaforge/envs/paperqa/lib/python3.11/site-packages/pypdf/_page.py:2493](https://file+.vscode-resource.vscode-cdn.net/Users/ekyzivat/Library/CloudStorage/Dropbox/Python/LLM-packages/~/mambaforge/envs/paperqa/lib/python3.11/site-packages/pypdf/_page.py:2493), in _VirtualList.__len__(self)
   [2492](https://file+.vscode-resource.vscode-cdn.net/Users/ekyzivat/Library/CloudStorage/Dropbox/Python/LLM-packages/~/mambaforge/envs/paperqa/lib/python3.11/site-packages/pypdf/_page.py:2492) def __len__(self) -> int:
-> [2493](https://file+.vscode-resource.vscode-cdn.net/Users/ekyzivat/Library/CloudStorage/Dropbox/Python/LLM-packages/~/mambaforge/envs/paperqa/lib/python3.11/site-packages/pypdf/_page.py:2493)     return self.length_function()

File [~/mambaforge/envs/paperqa/lib/python3.11/site-packages/pypdf/_reader.py:462](https://file+.vscode-resource.vscode-cdn.net/Users/ekyzivat/Library/CloudStorage/Dropbox/Python/LLM-packages/~/mambaforge/envs/paperqa/lib/python3.11/site-packages/pypdf/_reader.py:462), in PdfReader._get_num_pages(self)
    [460](https://file+.vscode-resource.vscode-cdn.net/Users/ekyzivat/Library/CloudStorage/Dropbox/Python/LLM-packages/~/mambaforge/envs/paperqa/lib/python3.11/site-packages/pypdf/_reader.py:460) else:
    [461](https://file+.vscode-resource.vscode-cdn.net/Users/ekyzivat/Library/CloudStorage/Dropbox/Python/LLM-packages/~/mambaforge/envs/paperqa/lib/python3.11/site-packages/pypdf/_reader.py:461)     if self.flattened_pages is None:
--> [462](https://file+.vscode-resource.vscode-cdn.net/Users/ekyzivat/Library/CloudStorage/Dropbox/Python/LLM-packages/~/mambaforge/envs/paperqa/lib/python3.11/site-packages/pypdf/_reader.py:462)         self._flatten()
    [463](https://file+.vscode-resource.vscode-cdn.net/Users/ekyzivat/Library/CloudStorage/Dropbox/Python/LLM-packages/~/mambaforge/envs/paperqa/lib/python3.11/site-packages/pypdf/_reader.py:463)     return len(self.flattened_pages)

File [~/mambaforge/envs/paperqa/lib/python3.11/site-packages/pypdf/_reader.py:1228](https://file+.vscode-resource.vscode-cdn.net/Users/ekyzivat/Library/CloudStorage/Dropbox/Python/LLM-packages/~/mambaforge/envs/paperqa/lib/python3.11/site-packages/pypdf/_reader.py:1228), in PdfReader._flatten(self, pages, inherit, indirect_reference)
   [1224](https://file+.vscode-resource.vscode-cdn.net/Users/ekyzivat/Library/CloudStorage/Dropbox/Python/LLM-packages/~/mambaforge/envs/paperqa/lib/python3.11/site-packages/pypdf/_reader.py:1224)     inherit = {}
   [1225](https://file+.vscode-resource.vscode-cdn.net/Users/ekyzivat/Library/CloudStorage/Dropbox/Python/LLM-packages/~/mambaforge/envs/paperqa/lib/python3.11/site-packages/pypdf/_reader.py:1225) if pages is None:
   [1226](https://file+.vscode-resource.vscode-cdn.net/Users/ekyzivat/Library/CloudStorage/Dropbox/Python/LLM-packages/~/mambaforge/envs/paperqa/lib/python3.11/site-packages/pypdf/_reader.py:1226)     # Fix issue 327: set flattened_pages attribute only for
   [1227](https://file+.vscode-resource.vscode-cdn.net/Users/ekyzivat/Library/CloudStorage/Dropbox/Python/LLM-packages/~/mambaforge/envs/paperqa/lib/python3.11/site-packages/pypdf/_reader.py:1227)     # decrypted file
-> [1228](https://file+.vscode-resource.vscode-cdn.net/Users/ekyzivat/Library/CloudStorage/Dropbox/Python/LLM-packages/~/mambaforge/envs/paperqa/lib/python3.11/site-packages/pypdf/_reader.py:1228)     catalog = self.trailer[TK.ROOT].get_object()
   [1229](https://file+.vscode-resource.vscode-cdn.net/Users/ekyzivat/Library/CloudStorage/Dropbox/Python/LLM-packages/~/mambaforge/envs/paperqa/lib/python3.11/site-packages/pypdf/_reader.py:1229)     pages = catalog["[/Pages](https://file+.vscode-resource.vscode-cdn.net/Pages)"].get_object()  # type: ignore
   [1230](https://file+.vscode-resource.vscode-cdn.net/Users/ekyzivat/Library/CloudStorage/Dropbox/Python/LLM-packages/~/mambaforge/envs/paperqa/lib/python3.11/site-packages/pypdf/_reader.py:1230)     self.flattened_pages = []

File [~/mambaforge/envs/paperqa/lib/python3.11/site-packages/pypdf/generic/_data_structures.py:333](https://file+.vscode-resource.vscode-cdn.net/Users/ekyzivat/Library/CloudStorage/Dropbox/Python/LLM-packages/~/mambaforge/envs/paperqa/lib/python3.11/site-packages/pypdf/generic/_data_structures.py:333), in DictionaryObject.__getitem__(self, key)
    [332](https://file+.vscode-resource.vscode-cdn.net/Users/ekyzivat/Library/CloudStorage/Dropbox/Python/LLM-packages/~/mambaforge/envs/paperqa/lib/python3.11/site-packages/pypdf/generic/_data_structures.py:332) def __getitem__(self, key: Any) -> PdfObject:
--> [333](https://file+.vscode-resource.vscode-cdn.net/Users/ekyzivat/Library/CloudStorage/Dropbox/Python/LLM-packages/~/mambaforge/envs/paperqa/lib/python3.11/site-packages/pypdf/generic/_data_structures.py:333)     return dict.__getitem__(self, key).get_object()

KeyError: '/Root'

@whitead
Copy link
Collaborator

whitead commented Jan 22, 2024

Awesome! Great find, thanks!

@whitead whitead merged commit 68e92e8 into Future-House:main Jan 22, 2024
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants