Best pdf extractor I have seen, but still not accurate enough #170

Crestina2001 · 2024-06-04T09:07:21Z

Thanks for your great work! But it still has some problems. I have a PDF, which is not scanned(you can select the words in the files). When using your method, it will recognize 'benefit' as 'benets'. It is strange in that when I use Foxit PDF editor, it will also do so, but when I use pymupdf, it just works fine. So it may be due to the issues of some specific packages.

In addition, there are still some issues with tables. So after using the pipeline, you still need to adjust the tables manually in the markdown to make sure they are correct. I don't have ideas how this could be improved. Just where to put the bounding box for table extraction is intimidating for me.

VikParuchuri · 2024-06-17T16:16:14Z

You can do OCR_ALL_PAGES=true to force OCR. Some PDFs will have had OCR run on them and text added (so you can select it), and that text can be bad if the OCR engine was not good.

VikParuchuri · 2024-10-18T15:57:25Z

Tables should be much better now

VikParuchuri closed this as completed Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best pdf extractor I have seen, but still not accurate enough #170

Best pdf extractor I have seen, but still not accurate enough #170

Crestina2001 commented Jun 4, 2024

VikParuchuri commented Jun 17, 2024 •

edited

Loading

VikParuchuri commented Oct 18, 2024

Best pdf extractor I have seen, but still not accurate enough #170

Best pdf extractor I have seen, but still not accurate enough #170

Comments

Crestina2001 commented Jun 4, 2024

VikParuchuri commented Jun 17, 2024 • edited Loading

VikParuchuri commented Oct 18, 2024

VikParuchuri commented Jun 17, 2024 •

edited

Loading