-
-
Notifications
You must be signed in to change notification settings - Fork 7
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
9 changed files
with
423 additions
and
87 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
include requirements.txt |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,145 @@ | ||
tika-app-python | ||
=============== | ||
|
||
Overview | ||
-------- | ||
|
||
tika-app-python is a wrapper for `Apache Tika App`_. | ||
|
||
Apache 2 Open Source License | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
tika-app-python can be downloaded, used, and modified free of charge. It | ||
is available under the Apache 2 license. | ||
|
||
Authors | ||
------- | ||
|
||
Main Author | ||
~~~~~~~~~~~ | ||
|
||
Fedele Mantuano (**Twitter**: | ||
[@fedelemantuano](https://twitter.com/fedelemantuano)) | ||
|
||
Installation | ||
------------ | ||
|
||
Clone repository | ||
|
||
:: | ||
|
||
git clone https://github.com/fedelemantuano/tika-app-python.git | ||
|
||
and install tika-app-python with ``setup.py``: | ||
|
||
:: | ||
|
||
cd tika-app-python | ||
|
||
python setup.py install | ||
|
||
or use ``pip``: | ||
|
||
:: | ||
|
||
pip install tika-app | ||
|
||
Usage | ||
----- | ||
|
||
Import ``TikaApp`` class: | ||
|
||
:: | ||
|
||
from tikapp import TikaApp | ||
|
||
tika_client = TikaApp(file_jar="/opt/tika/tika-app-1.13.jar") | ||
|
||
For get **content type**: | ||
|
||
:: | ||
|
||
tika_client.detect_content_type("your_file") | ||
|
||
For detect **language**: | ||
|
||
:: | ||
|
||
tika_client.detect_language("your_file") | ||
|
||
For detect **all metadata and content**: | ||
|
||
:: | ||
|
||
tika_client.extract_all_content("your_file") | ||
|
||
For detect **only content**: | ||
|
||
:: | ||
|
||
tika_client.extract_only_content("your_file") | ||
|
||
If you want to use payload in base64, you can use the same methods with | ||
``payload`` argument: | ||
|
||
:: | ||
|
||
tika_client.detect_content_type(payload="base64_payload") | ||
tika_client.detect_language(payload="base64_payload") | ||
tika_client.extract_all_content(payload="base64_payload") | ||
tika_client.extract_only_content(payload="base64_payload") | ||
|
||
Usage from command-line | ||
----------------------- | ||
|
||
If you installed tika-app-python with ``pip`` or ``setup.py`` you can | ||
use it with command-line. To use tika-app-python you should submit the | ||
Apache Tika app JAR. You can: - leave the default value: | ||
``/opt/tika/tika-app-1.13.jar`` - set the enviroment value | ||
``TIKA_APP_JAR`` - use ``--jar`` switch | ||
|
||
The last one overwrite all the others. | ||
|
||
These are all swithes: | ||
|
||
:: | ||
|
||
usage: tikapp [-h] (-f FILE | -p PAYLOAD) [-j JAR] [-d] [-t] [-l] [-a] | ||
[-v] | ||
|
||
Wrapper for Apache Tika App. | ||
|
||
optional arguments: | ||
-h, --help show this help message and exit | ||
-f FILE, --file FILE File to submit (default: None) | ||
-p PAYLOAD, --payload PAYLOAD | ||
Base64 payload to submit (default: None) | ||
-j JAR, --jar JAR Apache Tika app JAR (default: None) | ||
-d, --detect Detect document type (default: False) | ||
-t, --text Output plain text content (default: False) | ||
-l, --language Output only language (default: False) | ||
-a, --all Output metadata and content from all embedded files | ||
(default: False) | ||
-v, --version show program's version number and exit | ||
|
||
Example: | ||
|
||
.. code:: shell | ||
|
||
$ tikapp -f example_file -a | ||
|
||
Performance tests | ||
----------------- | ||
|
||
These are the results of performance tests in `tests`_ folder: | ||
|
||
:: | ||
|
||
tika_content_type() 0.708108 sec | ||
tika_detect_language() 1.748900 sec | ||
magic_content_type() 0.000215 sec | ||
tika_extract_all_content() 0.849755 sec | ||
tika_extract_only_content() 0.791735 sec | ||
|
||
.. _Apache Tika App: https://tika.apache.org/ | ||
.. _tests: https://github.com/fedelemantuano/tika-app-python/tree/develop/tests |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,3 @@ | ||
chainmap==1.0.2 | ||
python-magic==0.4.12 | ||
simplejson==3.8.2 | ||
simplejson==3.10.0 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,18 +1,56 @@ | ||
#!/usr/bin/env python | ||
# -*- coding: utf-8 -*- | ||
|
||
""" | ||
Copyright 2016 Fedele Mantuano (https://twitter.com/fedelemantuano) | ||
Licensed under the Apache License, Version 2.0 (the "License"); | ||
you may not use this file except in compliance with the License. | ||
You may obtain a copy of the License at | ||
http://www.apache.org/licenses/LICENSE-2.0 | ||
Unless required by applicable law or agreed to in writing, software | ||
distributed under the License is distributed on an "AS IS" BASIS, | ||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
See the License for the specific language governing permissions and | ||
limitations under the License. | ||
""" | ||
|
||
from os.path import join, dirname | ||
from distutils.core import setup | ||
from tikapp import __versionstr__ | ||
|
||
|
||
long_description = open(join(dirname(__file__), 'README')).read().strip() | ||
requires = open(join(dirname(__file__), | ||
'requirements.txt')).read().splitlines() | ||
|
||
|
||
setup( | ||
name='tika-app', | ||
version='0.4', | ||
description='Python client for Apache Tika App', | ||
license="Apache License, Version 2.0", | ||
url='https://github.com/fedelemantuano/tika-app-python', | ||
long_description=long_description, | ||
version=__versionstr__, | ||
author='Fedele Mantuano', | ||
author_email='[email protected]', | ||
maintainer='Fedele Mantuano', | ||
maintainer_email='[email protected]', | ||
url='https://github.com/fedelemantuano/tika-app-python', | ||
keywords=['tika', 'apache', 'toolkit'], | ||
requires=['simplejson'], | ||
license="Apache License, Version 2.0", | ||
packages=['tikapp'], | ||
platforms=["Linux", ], | ||
keywords=['tika', 'apache', 'toolkit'], | ||
classifiers=[ | ||
"License :: OSI Approved :: Apache Software License", | ||
"Intended Audience :: Developers", | ||
"Operating System :: OS Independent", | ||
"Programming Language :: Python", | ||
"Programming Language :: Python :: 2", | ||
"Programming Language :: Python :: 2.6", | ||
"Programming Language :: Python :: 2.7", | ||
], | ||
install_requires=requires, | ||
entry_points={'console_scripts': [ | ||
'tikapp = tikapp.__main__:main']}, | ||
) |
Oops, something went wrong.