You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for your great work! But it still has some problems. I have a PDF, which is not scanned(you can select the words in the files). When using your method, it will recognize 'benefit' as 'benets'. It is strange in that when I use Foxit PDF editor, it will also do so, but when I use pymupdf, it just works fine. So it may be due to the issues of some specific packages.
In addition, there are still some issues with tables. So after using the pipeline, you still need to adjust the tables manually in the markdown to make sure they are correct. I don't have ideas how this could be improved. Just where to put the bounding box for table extraction is intimidating for me.
The text was updated successfully, but these errors were encountered:
You can do OCR_ALL_PAGES=true to force OCR. Some PDFs will have had OCR run on them and text added (so you can select it), and that text can be bad if the OCR engine was not good.
Thanks for your great work! But it still has some problems. I have a PDF, which is not scanned(you can select the words in the files). When using your method, it will recognize 'benefit' as 'benets'. It is strange in that when I use Foxit PDF editor, it will also do so, but when I use pymupdf, it just works fine. So it may be due to the issues of some specific packages.
In addition, there are still some issues with tables. So after using the pipeline, you still need to adjust the tables manually in the markdown to make sure they are correct. I don't have ideas how this could be improved. Just where to put the bounding box for table extraction is intimidating for me.
The text was updated successfully, but these errors were encountered: