Skip to content

Commit

Permalink
Merge branch 'release/0.5'
Browse files Browse the repository at this point in the history
  • Loading branch information
fedelemantuano committed Nov 10, 2016
2 parents 95468a6 + 3627115 commit 81bd94c
Show file tree
Hide file tree
Showing 9 changed files with 423 additions and 87 deletions.
5 changes: 0 additions & 5 deletions MANIFEST

This file was deleted.

1 change: 1 addition & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
include requirements.txt
145 changes: 145 additions & 0 deletions README
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
tika-app-python
===============

Overview
--------

tika-app-python is a wrapper for `Apache Tika App`_.

Apache 2 Open Source License
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

tika-app-python can be downloaded, used, and modified free of charge. It
is available under the Apache 2 license.

Authors
-------

Main Author
~~~~~~~~~~~

Fedele Mantuano (**Twitter**:
[@fedelemantuano](https://twitter.com/fedelemantuano))

Installation
------------

Clone repository

::

git clone https://github.com/fedelemantuano/tika-app-python.git

and install tika-app-python with ``setup.py``:

::

cd tika-app-python

python setup.py install

or use ``pip``:

::

pip install tika-app

Usage
-----

Import ``TikaApp`` class:

::

from tikapp import TikaApp

tika_client = TikaApp(file_jar="/opt/tika/tika-app-1.13.jar")

For get **content type**:

::

tika_client.detect_content_type("your_file")

For detect **language**:

::

tika_client.detect_language("your_file")

For detect **all metadata and content**:

::

tika_client.extract_all_content("your_file")

For detect **only content**:

::

tika_client.extract_only_content("your_file")

If you want to use payload in base64, you can use the same methods with
``payload`` argument:

::

tika_client.detect_content_type(payload="base64_payload")
tika_client.detect_language(payload="base64_payload")
tika_client.extract_all_content(payload="base64_payload")
tika_client.extract_only_content(payload="base64_payload")

Usage from command-line
-----------------------

If you installed tika-app-python with ``pip`` or ``setup.py`` you can
use it with command-line. To use tika-app-python you should submit the
Apache Tika app JAR. You can: - leave the default value:
``/opt/tika/tika-app-1.13.jar`` - set the enviroment value
``TIKA_APP_JAR`` - use ``--jar`` switch

The last one overwrite all the others.

These are all swithes:

::

usage: tikapp [-h] (-f FILE | -p PAYLOAD) [-j JAR] [-d] [-t] [-l] [-a]
[-v]

Wrapper for Apache Tika App.

optional arguments:
-h, --help show this help message and exit
-f FILE, --file FILE File to submit (default: None)
-p PAYLOAD, --payload PAYLOAD
Base64 payload to submit (default: None)
-j JAR, --jar JAR Apache Tika app JAR (default: None)
-d, --detect Detect document type (default: False)
-t, --text Output plain text content (default: False)
-l, --language Output only language (default: False)
-a, --all Output metadata and content from all embedded files
(default: False)
-v, --version show program's version number and exit

Example:

.. code:: shell

$ tikapp -f example_file -a

Performance tests
-----------------

These are the results of performance tests in `tests`_ folder:

::

tika_content_type() 0.708108 sec
tika_detect_language() 1.748900 sec
magic_content_type() 0.000215 sec
tika_extract_all_content() 0.849755 sec
tika_extract_only_content() 0.791735 sec

.. _Apache Tika App: https://tika.apache.org/
.. _tests: https://github.com/fedelemantuano/tika-app-python/tree/develop/tests
42 changes: 40 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ or use `pip`:
pip install tika-app
```

## Usage
## Usage in a project

Import `TikaApp` class:

Expand Down Expand Up @@ -79,9 +79,47 @@ tika_client.extract_all_content(payload="base64_payload")
tika_client.extract_only_content(payload="base64_payload")
```

## Usage from command-line

If you installed tika-app-python with `pip` or `setup.py` you can use it with command-line.
To use tika-app-python you should submit the Apache Tika app JAR. You can:
- leave the default value: `/opt/tika/tika-app-1.13.jar`
- set the enviroment value `TIKA_APP_JAR`
- use `--jar` switch

The last one overwrite all the others.

These are all swithes:

```
usage: tikapp [-h] (-f FILE | -p PAYLOAD) [-j JAR] [-d] [-t] [-l] [-a]
[-v]
Wrapper for Apache Tika App.
optional arguments:
-h, --help show this help message and exit
-f FILE, --file FILE File to submit (default: None)
-p PAYLOAD, --payload PAYLOAD
Base64 payload to submit (default: None)
-j JAR, --jar JAR Apache Tika app JAR (default: None)
-d, --detect Detect document type (default: False)
-t, --text Output plain text content (default: False)
-l, --language Output only language (default: False)
-a, --all Output metadata and content from all embedded files
(default: False)
-v, --version show program's version number and exit
```

Example:

```shell
$ tikapp -f example_file -a
```

## Performance tests

These are the results of performance tests in [profiling](https://github.com/fedelemantuano/tika-app-python/tree/develop/profiling) folder:
These are the results of performance tests in [tests](https://github.com/fedelemantuano/tika-app-python/tree/develop/tests) folder:

```
tika_content_type() 0.708108 sec
Expand Down
3 changes: 2 additions & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
chainmap==1.0.2
python-magic==0.4.12
simplejson==3.8.2
simplejson==3.10.0
48 changes: 43 additions & 5 deletions setup.py
Original file line number Diff line number Diff line change
@@ -1,18 +1,56 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-

"""
Copyright 2016 Fedele Mantuano (https://twitter.com/fedelemantuano)
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
"""

from os.path import join, dirname
from distutils.core import setup
from tikapp import __versionstr__


long_description = open(join(dirname(__file__), 'README')).read().strip()
requires = open(join(dirname(__file__),
'requirements.txt')).read().splitlines()


setup(
name='tika-app',
version='0.4',
description='Python client for Apache Tika App',
license="Apache License, Version 2.0",
url='https://github.com/fedelemantuano/tika-app-python',
long_description=long_description,
version=__versionstr__,
author='Fedele Mantuano',
author_email='[email protected]',
maintainer='Fedele Mantuano',
maintainer_email='[email protected]',
url='https://github.com/fedelemantuano/tika-app-python',
keywords=['tika', 'apache', 'toolkit'],
requires=['simplejson'],
license="Apache License, Version 2.0",
packages=['tikapp'],
platforms=["Linux", ],
keywords=['tika', 'apache', 'toolkit'],
classifiers=[
"License :: OSI Approved :: Apache Software License",
"Intended Audience :: Developers",
"Operating System :: OS Independent",
"Programming Language :: Python",
"Programming Language :: Python :: 2",
"Programming Language :: Python :: 2.6",
"Programming Language :: Python :: 2.7",
],
install_requires=requires,
entry_points={'console_scripts': [
'tikapp = tikapp.__main__:main']},
)
Loading

0 comments on commit 81bd94c

Please sign in to comment.