text_extractor

This script takes rich files from a folder e.g. word, pdf, etc. and extracts the text from them using Apache Tika.

The resulting text is saved as .json files for each original file.

Instructions

1 - Obtain a new Ubuntu server [e.g. c9.io (free), VirtualBox, AWS, Godaddy cloud, etc.]
2 - Copy the installer script to the server:
$ wget https://raw.githubusercontent.com/jmmnn/text_extractor/master/server_install.py
3 - Run the istaller, click yes when necessary:
$ python3 server_install.py #in Ubuntu 14 you can do just python, but
in Ubuntu 16 only python3 is installed by default.

At this point you have all you need!

If you want to test it:
4 - Change directory to text_extractor:
$ cd text_extractor
5 - Then run:
$ python text_extract.py

To run, just place your files in the "original_files" folder and run the command above again. (You can do this by sftp to your server, or getting the files using wget)

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
json_files		json_files
original_files		original_files
sample_files		sample_files
LICENSE		LICENSE
README.md		README.md
server_install.py		server_install.py
text_extract.py		text_extract.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

text_extractor

Instructions

About

Releases

Packages

Languages

License

jmmnn/text_extractor

Folders and files

Latest commit

History

Repository files navigation

text_extractor

Instructions

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages