Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best pdf extractor I have seen, but still not accurate enough #170

Closed
Crestina2001 opened this issue Jun 4, 2024 · 2 comments
Closed

Best pdf extractor I have seen, but still not accurate enough #170

Crestina2001 opened this issue Jun 4, 2024 · 2 comments

Comments

@Crestina2001
Copy link

Thanks for your great work! But it still has some problems. I have a PDF, which is not scanned(you can select the words in the files). When using your method, it will recognize 'benefit' as 'benets'. It is strange in that when I use Foxit PDF editor, it will also do so, but when I use pymupdf, it just works fine. So it may be due to the issues of some specific packages.

In addition, there are still some issues with tables. So after using the pipeline, you still need to adjust the tables manually in the markdown to make sure they are correct. I don't have ideas how this could be improved. Just where to put the bounding box for table extraction is intimidating for me.

@VikParuchuri
Copy link
Owner

VikParuchuri commented Jun 17, 2024

You can do OCR_ALL_PAGES=true to force OCR. Some PDFs will have had OCR run on them and text added (so you can select it), and that text can be bad if the OCR engine was not good.

@VikParuchuri
Copy link
Owner

Tables should be much better now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants