Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add image label to output MD file. #52

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

tungsten106
Copy link

  • Using Pymupdf package to extract image bbox and sorted with y-position, adding the MD formated image label as text to the output markdown file;
  • Image data saved in metadata.json file with key "image" and is a Dict, format: {img_path: img_byte_content}, it then could be saved to each path with the file convert_single.py.
  • Not all pictures in pdf (such as image on page 2 of Multi-column CNN) could not be identified, as noted by @yachty66. But technically that is not a picture, it is an image formed with text boxes and arrows, etc. I am unsure about how to resolve this at the moment as well.
    Hope it could helps :)

@VikParuchuri
Copy link
Owner

@tungsten106 Thanks for much for this! It was on my list of functionality to add soon. I'll take a look next week (after the holiday).

@VikParuchuri
Copy link
Owner

@tungsten106 I'd love to review this, but the diffs seem to have issues (entire file is shown as deleted, with all the lines also shown as added). I'm having a hard time seeing what was changed. Do you know why this is happening with the diffs?

@tungsten106
Copy link
Author

@tungsten106 I'd love to review this, but the diffs seem to have issues (entire file is shown as deleted, with all the lines also shown as added). I'm having a hard time seeing what was changed. Do you know why this is happening with the diffs?

It is probably a problem raised by Windows vscode end-of-line sequence settings. I have changed its selection from CRLF back to LF, and the diff should work now.

@OmriNach
Copy link

OmriNach commented Jan 4, 2024

Following to know when this is implemented. With GPT4V out, the focus is on multimodal retrieval systems. Since marker outperforms most pdf readers, the addition of images would make it very valuable for general purpose pdf loading for this purpose.

@morizin
Copy link

morizin commented Jan 24, 2024

Not all pictures in pdf (such as image on page 2 of Multi-column CNN) could not be identified, as noted by @yachty66. But technically that is not a picture, it is an image formed with text boxes and arrows, etc. I am unsure about how to resolve this at the moment as well.

Why can't we do somethingg like get the box and screenshot that part and add

@CBIhalsen
Copy link

After adding the image, continue to add the translation function to the project, and right-click the image and select GPT-4-vision to answer, which will be a great essay tool.

@catalystK
Copy link

catalystK commented Mar 12, 2024

Is the image extract feature included in latest, as today, i cloned git-master branch (as there is no release) and ran
i couldnt get the image in output .md file, I thought, MD file, will have image embeddings in it.. but didnt find any
Should i set any variable, to extract image, and emebd it tinto, output md file?..

is this feature upcoming..

also, is there any way, I can run this on hugginface, deploy there -- can you create something similar, some remote solution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants