[BUG] When `file_type` is set to `all_files`, only `pdfminer` is used #1065

e7217 · 2024-12-20T03:22:11Z

Describe the bug
Hello,

It seems that when the parser module is configured with file_type: all_files, only pdfminer is applied. I have tried using langchain_parser/upstagedocumentparse and llamaparser, and both appear to use pdfminer exclusively. Even when I set the output_format to html, it seems like pdfminer is still being used. Am I mistaken about something?

Below is the YAML file I configured:

- module_type: langchain_parse
  parse_method: upstagedocumentparse
  split: page
  file_type: all_files
  output_format: html

or

- module_type: llamaparse
  result_type: markdown
  file_type: all_files
  language: ko

And here is the result:

I would appreciate your help. Thank you.

The text was updated successfully, but these errors were encountered:

e7217 · 2024-12-20T06:06:31Z

I found the way to set all_files.
To use it, we should set all_files in start_parsing() to True

parser.start_parsing(parse_config, all_files=True)

what is correct way to set all files between as follows?

start_parsing(... , all_files=True)
set file_type to all_files in parser yaml

Additionally, to apply various parse configurations, should I modify the configuration file each time or create a new configuration for each module?

bwook00 · 2024-12-21T01:39:43Z

First, sorry for the late comment (I was on vacation this week🫣)

You're right, you need to do start_parsing(... , all_files=True) to use file_type as all_files!
I forgot and didn't write it in Docs,, sorry for the confusion!

I designed it so that if you don't put all_files=True, it will parse with pdfminer, which is the default parse method for PDF.

I'll add it to Docs right away.

Finally, sorry for the confusion and Thanks for your hard contribution!!!

e7217 added the bug Something isn't working label Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] When `file_type` is set to `all_files`, only `pdfminer` is used #1065

[BUG] When `file_type` is set to `all_files`, only `pdfminer` is used #1065

e7217 commented Dec 20, 2024

e7217 commented Dec 20, 2024

bwook00 commented Dec 21, 2024

[BUG] When file_type is set to all_files, only pdfminer is used #1065

[BUG] When file_type is set to all_files, only pdfminer is used #1065

Comments

e7217 commented Dec 20, 2024

e7217 commented Dec 20, 2024

bwook00 commented Dec 21, 2024

[BUG] When `file_type` is set to `all_files`, only `pdfminer` is used #1065

[BUG] When `file_type` is set to `all_files`, only `pdfminer` is used #1065