Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] When file_type is set to all_files, only pdfminer is used #1065

Open
e7217 opened this issue Dec 20, 2024 · 2 comments
Open

[BUG] When file_type is set to all_files, only pdfminer is used #1065

e7217 opened this issue Dec 20, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@e7217
Copy link
Contributor

e7217 commented Dec 20, 2024

Describe the bug
Hello,

It seems that when the parser module is configured with file_type: all_files, only pdfminer is applied. I have tried using langchain_parser/upstagedocumentparse and llamaparser, and both appear to use pdfminer exclusively. Even when I set the output_format to html, it seems like pdfminer is still being used. Am I mistaken about something?

Below is the YAML file I configured:

- module_type: langchain_parse
  parse_method: upstagedocumentparse
  split: page
  file_type: all_files
  output_format: html

or

- module_type: llamaparse
  result_type: markdown
  file_type: all_files
  language: ko

And here is the result:
Image

I would appreciate your help. Thank you.

@e7217 e7217 added the bug Something isn't working label Dec 20, 2024
@e7217
Copy link
Contributor Author

e7217 commented Dec 20, 2024

I found the way to set all_files.
To use it, we should set all_files in start_parsing() to True

parser.start_parsing(parse_config, all_files=True)

what is correct way to set all files between as follows?

  • start_parsing(... , all_files=True)
  • set file_type to all_files in parser yaml

Additionally, to apply various parse configurations, should I modify the configuration file each time or create a new configuration for each module?

@bwook00
Copy link
Contributor

bwook00 commented Dec 21, 2024

First, sorry for the late comment (I was on vacation this week🫣)

You're right, you need to do start_parsing(... , all_files=True) to use file_type as all_files!
I forgot and didn't write it in Docs,, sorry for the confusion!

I designed it so that if you don't put all_files=True, it will parse with pdfminer, which is the default parse method for PDF.

I'll add it to Docs right away.

Finally, sorry for the confusion and Thanks for your hard contribution!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants