Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add OCR / translation #24

Open
wants to merge 26 commits into
base: ocr
Choose a base branch
from
Open

add OCR / translation #24

wants to merge 26 commits into from

Conversation

behrica
Copy link

@behrica behrica commented Apr 4, 2021

new PR as requested

R/keyword_search.r Outdated Show resolved Hide resolved
R/azure.R Outdated Show resolved Hide resolved
Comment on lines +94 to +95
dir.create(output_dir_name,showWarnings = F)
writeLines(x,file(paste0(dirname(path),"/outputs/",basename(path),".txt")) )
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@behrica I'm a bit wary of this piece of code to create the directory and write the lines to that directory without the user directly specifying the directory or opting in to saving the results. I saw similar code earlier on too, is this piece needed for the OCR?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for this is traceability.
My users like the fact, that the magic of the OCR + translation becomes visible in the form of files,
so they can archive the whole folder of original PDFs + OCR result + translation result.

This can eventually later explain, why certain things have not been found., which is important in our context.

Maybe there is a better way to do this in R.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could have an option to enable or disable this.

Same is true for progress logging.

As the eventually large pdfs gets uploaded to Azure, the "search" on lots of files could take hours.
(I implemented some caching, so a second search will not do it anymore),

Do you have a suggestion for this ?

README.md Outdated Show resolved Hide resolved
R/directory_search.r Outdated Show resolved Hide resolved
R/keyword_search.r Outdated Show resolved Hide resolved
R/keyword_search.r Outdated Show resolved Hide resolved
behrica and others added 2 commits April 7, 2021 21:54
 cleaned up some commented code
 replaced T -> TRUE, F -> FALSE
@behrica
Copy link
Author

behrica commented Apr 7, 2021

I cleaned up the code, used argument lists and replaced T -> TRUE, F-> FALSE

@behrica
Copy link
Author

behrica commented Apr 7, 2021

I am aware, that I am not a good R programmer, more an R user,
but I hope we can get the code in an acceptable state.

My PR is in this sense as well more a "proof of concept" then "ready made code",
which demonstrate the usefullness and feasibility of the feature of automatic OCR and translation.

For my colleges this is a real time safer.

@lebebr01 lebebr01 linked an issue Nov 16, 2021 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

enhancing this package with "OCR" and "translation"
2 participants