Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow read_pdf to accept a file-like object #103

Open
Lnk2past opened this issue Dec 6, 2019 · 12 comments
Open

Allow read_pdf to accept a file-like object #103

Lnk2past opened this issue Dec 6, 2019 · 12 comments

Comments

@Lnk2past
Copy link

Lnk2past commented Dec 6, 2019

In our use case we have PDF data streamed in memory from an external service; in order for us to process it using camelot we need to save that data out to a file and then pass the filename over. It would be great to be able to just send a file-like object through the interface instead, as this would save us from needing to write temporary files only to read them back in. I do not think there is a workaround for this at the moment, but if there is any information would be greatly appreciated.

I do not know if I will have time immediately soon to work on a PR, but does this sound like a reasonable feature to add?

@yeus
Copy link

yeus commented Oct 5, 2020

in this other repository (https://github.com/atlanhq/camelot) (I assume the original one?) there are already two merge requests pending and aiting to get accepted for this issue:

Maybe we can do this quickly with that ;). I think this is really a feature that a lot of poeple would like to have ...

@vinayak-mehta
Copy link
Member

Thanks for pointing that out! Right now #13 is taking up a lot of my time, but I will try to get to this over the weekend.

@yeus
Copy link

yeus commented Oct 5, 2020

For poeple where the main problem is, that you want to keep the file "in-memory" for example as a spooled temporary file, a short workaround could be the following:

use this library here: https://github.com/mbello/memory-tempfile to create a file on a a tmpfs in our memory. This soluion only works for linux though ... Additionally, its difficult to do this in docker images or on kubernetes.

@yeus
Copy link

yeus commented Oct 5, 2020

@vinayak-mehta just saw your comment. Looking forward to this! If you need any help (testing, review...) just contact me ;) although I am not that deep into the library ...

@vinayak-mehta
Copy link
Member

Thanks for the suggestion, and for offering your help! I will try to get to the PRs by the weekend and will definitely comment here if I need help :)

@pilotjoe
Copy link

pilotjoe commented Oct 8, 2020

I mentioned another use case for this in atlanhq/camelot#189, where reading from file-like object would come in handy when more advanced authentication is required for websites (e.g. SharePoint), requiring pulling the object using a library like requests.

@vinayak-mehta
Copy link
Member

@pilotjoe Thank you for your comment describing the use-case.

Last week, I ended up spending a lot of time on #13. Will get to this soon.

@yash12392
Copy link

yash12392 commented Mar 9, 2021

Hey @vinayak-mehta , just checking in if you got around to doing this?

@HeskethGD
Copy link

Would love this feature to be implemented. The use case is an AWS Lambda function that has read a pdf from S3, processed it with regex to find relevant pages then we wish to pass the relevant pages as bytes to a table extraction package, ideally without having to write/read to/from file again in the Lambda.

@Vesalon
Copy link

Vesalon commented Jun 23, 2023

want to add to the comments that this would a very useful feature to access. writing and reading from disk can be quite expensive

@yg-smile
Copy link

This would be very useful feature. Big appreciate if there is any update

@bosd
Copy link

bosd commented Sep 1, 2024

This would be very useful feature. Big appreciate if there is any update

I was working an a forward port over here:
py-pdf#16
Feel free to help and contribute over there.

Niremizov pushed a commit to omkod/camelot that referenced this issue Sep 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants