add OCR / translation #24

behrica · 2021-04-04T15:26:44Z

new PR as requested

fixed docu

R/keyword_search.r

R/azure.R

lebebr01 · 2021-04-06T17:08:37Z

R/keyword_search.r

+    dir.create(output_dir_name,showWarnings = F)
+    writeLines(x,file(paste0(dirname(path),"/outputs/",basename(path),".txt")) )


@behrica I'm a bit wary of this piece of code to create the directory and write the lines to that directory without the user directly specifying the directory or opting in to saving the results. I saw similar code earlier on too, is this piece needed for the OCR?

The reason for this is traceability.
My users like the fact, that the magic of the OCR + translation becomes visible in the form of files,
so they can archive the whole folder of original PDFs + OCR result + translation result.

This can eventually later explain, why certain things have not been found., which is important in our context.

Maybe there is a better way to do this in R.

we could have an option to enable or disable this.

Same is true for progress logging.

As the eventually large pdfs gets uploaded to Azure, the "search" on lots of files could take hours.
(I implemented some caching, so a second search will not do it anymore),

Do you have a suggestion for this ?

README.md

R/directory_search.r

R/keyword_search.r

Co-authored-by: Brandon LeBeau <[email protected]>

cleaned up some commented code replaced T -> TRUE, F -> FALSE

behrica · 2021-04-07T20:57:12Z

I cleaned up the code, used argument lists and replaced T -> TRUE, F-> FALSE

behrica · 2021-04-07T21:01:59Z

I am aware, that I am not a good R programmer, more an R user,
but I hope we can get the code in an acceptable state.

My PR is in this sense as well more a "proof of concept" then "ready made code",
which demonstrate the usefullness and feasibility of the feature of automatic OCR and translation.

For my colleges this is a real time safer.

behrica and others added 23 commits November 20, 2020 13:15

implemened first OCR

e994e13

added translation

2b82958

bump version and fixes

3fe8bb0

pass use_azure

1b7823f

fixed docu

refactored

5270b8c

fixed indentity function

09a7a85

fixed syntax error

a38b7c2

fixed functions

137207d

fixed NAMESPACE

8c57fa3

text chunking - WIP

19420cb

added path to translation

84afde9

fixed chunking and all

f01e1c7

removed unused code

6e1cfd4

use all ocr-ed content

e3da261

removed unused code

dca5526

re-implemened spli

64a3da5

fixed splitting

f41acea

fixed imports

02096a8

increased threshold

8aedce1

saving of pdf_text result

3185921

added dplyr

47d23da

fixed output folder creation

2e00d92

Update README.md

5b597c0