Skip to content

Commit

Permalink
Merge branch 'release/1.4.0'
Browse files Browse the repository at this point in the history
  • Loading branch information
fedelemantuano committed Jul 24, 2018
2 parents 6c9636c + 24d0596 commit c008f84
Show file tree
Hide file tree
Showing 14 changed files with 330 additions and 180 deletions.
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,4 @@ MANIFEST
build/
dist/
tika_app.egg-info/
venv/
venv*/
14 changes: 12 additions & 2 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ python:
- "3.6"

env:
- TIKA_VER="1.16"
- TIKA_VER="1.18"
TIKA_APP_JAR=/tmp/tika-app-${TIKA_VER}.jar

before_script:
Expand Down Expand Up @@ -37,11 +37,21 @@ script:
- python -m tikapp -v
- python -m tikapp -h
- python -m tikapp -a -f tests/files/test.zip
- python -m tikapp -a -k < tests/files/test.zip

deploy:
provider: pypi
user: fmantuano
password:
secure: "p7l2yeLecvW/j1zs1XvxUbqT8f1ATIuao8TmQ4vJMCKJkhsRComHHkoHs5gYSdhj5QsuTSgcrsxV0PXfdEtyCAndYaZGrySNxlNBIShX0WArDfJCTcvnMWgv+4KoPbGwH27oxmo+icqCiCML4y07aRS8IKK1L/YxodbDTnCGK2dWiYm7VZS1oMFBqZiMxvRde3nE82nqlP6U3lvTE8HFNLlUIUa4hPAOXOuL/tX3L3alpStiBqwQMcHLOgdJuU47MXugfNNa3/u/mYeq4FGY85qcYOi/nXLzD01yAl6saeQ8FXtKDIIgnDVotstyMP31t1MU2yC7fxAj3XgHMbyRC9mCRJbgveHIEbnXnka5xl2mhEKa5e+mea9d3w8XWbi9ftx661w8x1V25v6+RH40WDCszbZ4K3cneWNIC5lRLlMLZ7JUB+L/G72dsBOO4BvCfQeo04WyO7GD0klnWxTNHo28ryOQ0e1Z/v1ocnkF/3ZwbgFjp9/I+rPjwNlHb/tzgr9hyD8BshA1nUE4ZOi+EnZNcuBotysia9tJ9EncjWXv0inUj9VenNqYROrF+xaDnKQRjAQr51CTz4uLA5FqauwNNmtgWoKSZVBwjCdSBnWGLYx059bAzkdhgP4sfxvfzhNxMVDhAucBGdXPeecxrzzfHBVpHkuDJXWuBPT7vwQ="
on:
tags: true
branch: master

after_success:
coveralls

notifications:
email: false
slack:
secure: gLkkwrBjb0jCuiqMCRM5hXPzYH+LNA5UpMcvjULKtvywVnFVren0UYVyp6h81eQhsvIZecghESTitwRwL7Ttl9/TEmiJ1fuD/pjUbRtUDQourm3zpPAlxppVaj61Hfnln6MyPW+1QIlfXpOJRl+k7RrLKTm0FxXP6TS9t9t+p97ZIVz/iOVGwYWNgeSIdxy6sdzISkMx3i3bn7tr/ILruc18wqbaBs7GzQpgjaFl0S+7PDv4vBmj/9dTYxku6G+nSuzz+Do+BXAMCdcSEn4O4HT+tYyxmkCgRqn7zM8KtAXQwNkuSdjTOZo3Pn917jZibrEj8SaqbfXW1Q2BSN30zTk6p0Y22DF26qbYcy1XX6VDL52oy9GCGth1vNGNkLh/rFQHhZfCXvaSz1jws9vrtbEtHPXaYqfA+p/Xi01N1ewLMaL8yWA9NnCNF9r178bjiKtb7TAWu7B2o2I1j3FqfvtXsQsCQvNGTUc0LFP6i3geS056J8jrtz21IXvUmAJLHY5Qx9j88/lwA2HnhJquY7pFUztjXTgl2JBsBeGgzyoERnhq75iWQATteqbBkW1t5jkivw8g5QNbwln10PQij0SvvV9Cr02W7yX2nXt77/YeHR7ddwxxTNK8xUcjJJGUdB3AAGq1R92G/rEs4WfTniS8wG0CksOqyZ2BF1Q0gdk=
secure: "gLkkwrBjb0jCuiqMCRM5hXPzYH+LNA5UpMcvjULKtvywVnFVren0UYVyp6h81eQhsvIZecghESTitwRwL7Ttl9/TEmiJ1fuD/pjUbRtUDQourm3zpPAlxppVaj61Hfnln6MyPW+1QIlfXpOJRl+k7RrLKTm0FxXP6TS9t9t+p97ZIVz/iOVGwYWNgeSIdxy6sdzISkMx3i3bn7tr/ILruc18wqbaBs7GzQpgjaFl0S+7PDv4vBmj/9dTYxku6G+nSuzz+Do+BXAMCdcSEn4O4HT+tYyxmkCgRqn7zM8KtAXQwNkuSdjTOZo3Pn917jZibrEj8SaqbfXW1Q2BSN30zTk6p0Y22DF26qbYcy1XX6VDL52oy9GCGth1vNGNkLh/rFQHhZfCXvaSz1jws9vrtbEtHPXaYqfA+p/Xi01N1ewLMaL8yWA9NnCNF9r178bjiKtb7TAWu7B2o2I1j3FqfvtXsQsCQvNGTUc0LFP6i3geS056J8jrtz21IXvUmAJLHY5Qx9j88/lwA2HnhJquY7pFUztjXTgl2JBsBeGgzyoERnhq75iWQATteqbBkW1t5jkivw8g5QNbwln10PQij0SvvV9Cr02W7yX2nXt77/YeHR7ddwxxTNK8xUcjJJGUdB3AAGq1R92G/rEs4WfTniS8wG0CksOqyZ2BF1Q0gdk="
31 changes: 26 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,12 @@
## Overview

tika-app-python is a wrapper for [Apache Tika App](https://tika.apache.org/).
With this library you can analyze:
- file on disk
- payload in base64
- file object (like standard input)

To use file object function you should use Apache Tika version >= 1.17.

### Apache 2 Open Source License
tika-app-python can be downloaded, used, and modified free of charge. It is available under the Apache 2 license.
Expand Down Expand Up @@ -48,7 +54,7 @@ Import `TikaApp` class:
```
from tikapp import TikaApp
tika_client = TikaApp(file_jar="/opt/tika/tika-app-1.15.jar")
tika_client = TikaApp(file_jar="/opt/tika/tika-app-1.18.jar")
```

For get **content type**:
Expand All @@ -75,7 +81,7 @@ For detect **only content**:
tika_client.extract_only_content("your_file")
```

If you want to use payload in base64, you can use the same methods with `payload` argument:
You can analyze payload in base64 with the same methods, but passing `payload` argument:

```
tika_client.detect_content_type(payload="base64_payload")
Expand All @@ -84,6 +90,14 @@ tika_client.extract_all_content(payload="base64_payload")
tika_client.extract_only_content(payload="base64_payload")
```

or you can analyze file object (like standard input) with the same methods, but passing `objectInput` argument:

```
tika_client.detect_language(objectInput="objectInput")
tika_client.extract_all_content(objectInput="objectInput")
tika_client.extract_only_content(objectInput="objectInput")
```

## Usage from command-line

If you installed tika-app-python with `pip` or `setup.py` you can use it with command-line.
Expand All @@ -97,8 +111,8 @@ The last one overwrite all the others.
These are all swithes:

```
usage: tikapp [-h] (-f FILE | -p PAYLOAD) [-j JAR] [-d] [-t] [-l] [-a]
[-v]
usage: tikapp [-h] (-f FILE | -p PAYLOAD | -k) [-j JAR] [-d] [-t] [-l]
[-a] [-v]
Wrapper for Apache Tika App.
Expand All @@ -107,6 +121,7 @@ optional arguments:
-f FILE, --file FILE File to submit (default: None)
-p PAYLOAD, --payload PAYLOAD
Base64 payload to submit (default: None)
-k, --stdin Enable parsing from stdin (default: False)
-j JAR, --jar JAR Apache Tika app JAR (default: None)
-d, --detect Detect document type (default: False)
-t, --text Output plain text content (default: False)
Expand All @@ -116,12 +131,18 @@ optional arguments:
-v, --version show program's version number and exit
```

Example:
Example from file on disk:

```shell
$ tikapp -f example_file -a
```

Example from standard input

```shell
$ tikapp -a -k < example_file
```

## Performance tests

These are the results of performance tests in [tests](https://github.com/fedelemantuano/tika-app-python/tree/develop/tests) folder:
Expand Down
126 changes: 70 additions & 56 deletions README.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,8 @@
|PyPI version| |Build Status| |Coverage Status| |BCH compliance|
`PyPI version <https://badge.fury.io/py/tika-app>`__ `Build
Status <https://travis-ci.org/fedelemantuano/tika-app-python>`__
`Coverage
Status <https://coveralls.io/github/fedelemantuano/tika-app-python?branch=master>`__
`BCH compliance <https://bettercodehub.com/>`__

tika-app-python
===============
Expand All @@ -7,7 +11,10 @@ Overview
--------

tika-app-python is a wrapper for `Apache Tika
App <https://tika.apache.org/>`__.
App <https://tika.apache.org/>`__. With this library you can analyze: -
file on disk - payload in base64 - file object (like standard input)

To use file object function you should use Apache Tika version >= 1.17.

Apache 2 Open Source License
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand All @@ -31,21 +38,21 @@ Clone repository

::

git clone https://github.com/fedelemantuano/tika-app-python.git
git clone https://github.com/fedelemantuano/tika-app-python.git

and install tika-app-python with ``setup.py``:

::

cd tika-app-python
cd tika-app-python

python setup.py install
python setup.py install

or use ``pip``:

::

pip install tika-app
pip install tika-app

Usage in a project
------------------
Expand All @@ -54,43 +61,52 @@ Import ``TikaApp`` class:

::

from tikapp import TikaApp
from tikapp import TikaApp

tika_client = TikaApp(file_jar="/opt/tika/tika-app-1.15.jar")
tika_client = TikaApp(file_jar="/opt/tika/tika-app-1.18.jar")

For get **content type**:

::

tika_client.detect_content_type("your_file")
tika_client.detect_content_type("your_file")

For detect **language**:

::

tika_client.detect_language("your_file")
tika_client.detect_language("your_file")

For detect **all metadata and content**:

::

tika_client.extract_all_content("your_file")
tika_client.extract_all_content("your_file")

For detect **only content**:

::

tika_client.extract_only_content("your_file")
tika_client.extract_only_content("your_file")

If you want to use payload in base64, you can use the same methods with
You can analyze payload in base64 with the same methods, but passing
``payload`` argument:

::

tika_client.detect_content_type(payload="base64_payload")
tika_client.detect_language(payload="base64_payload")
tika_client.extract_all_content(payload="base64_payload")
tika_client.extract_only_content(payload="base64_payload")
tika_client.detect_content_type(payload="base64_payload")
tika_client.detect_language(payload="base64_payload")
tika_client.extract_all_content(payload="base64_payload")
tika_client.extract_only_content(payload="base64_payload")

or you can analyze file object (like standard input) with the same
methods, but passing ``objectInput`` argument:

::

tika_client.detect_language(objectInput="objectInput")
tika_client.extract_all_content(objectInput="objectInput")
tika_client.extract_only_content(objectInput="objectInput")

Usage from command-line
-----------------------
Expand All @@ -107,29 +123,36 @@ These are all swithes:

::

usage: tikapp [-h] (-f FILE | -p PAYLOAD) [-j JAR] [-d] [-t] [-l] [-a]
[-v]
usage: tikapp [-h] (-f FILE | -p PAYLOAD | -k) [-j JAR] [-d] [-t] [-l]
[-a] [-v]

Wrapper for Apache Tika App.
Wrapper for Apache Tika App.

optional arguments:
-h, --help show this help message and exit
-f FILE, --file FILE File to submit (default: None)
-p PAYLOAD, --payload PAYLOAD
Base64 payload to submit (default: None)
-k, --stdin Enable parsing from stdin (default: False)
-j JAR, --jar JAR Apache Tika app JAR (default: None)
-d, --detect Detect document type (default: False)
-t, --text Output plain text content (default: False)
-l, --language Output only language (default: False)
-a, --all Output metadata and content from all embedded files
(default: False)
-v, --version show program's version number and exit

Example from file on disk:

.. code:: shell
optional arguments:
-h, --help show this help message and exit
-f FILE, --file FILE File to submit (default: None)
-p PAYLOAD, --payload PAYLOAD
Base64 payload to submit (default: None)
-j JAR, --jar JAR Apache Tika app JAR (default: None)
-d, --detect Detect document type (default: False)
-t, --text Output plain text content (default: False)
-l, --language Output only language (default: False)
-a, --all Output metadata and content from all embedded files
(default: False)
-v, --version show program's version number and exit
$ tikapp -f example_file -a
Example:
Example from standard input

.. code:: shell
$ tikapp -f example_file -a
$ tikapp -a -k < example_file
Performance tests
-----------------
Expand All @@ -140,25 +163,16 @@ folder:

::

(Python 2)
tika_content_type() 0.704840 sec
tika_detect_language() 1.592066 sec
magic_content_type() 0.000215 sec
tika_extract_all_content() 0.816366 sec
tika_extract_only_content() 0.788667 sec

(Python 3)
tika_content_type() 0.698357 sec
tika_detect_language() 1.593452 sec
magic_content_type() 0.000226 sec
tika_extract_all_content() 0.785915 sec
tika_extract_only_content() 0.766517 sec

.. |PyPI version| image:: https://badge.fury.io/py/tika-app.svg
:target: https://badge.fury.io/py/tika-app
.. |Build Status| image:: https://travis-ci.org/fedelemantuano/tika-app-python.svg?branch=master
:target: https://travis-ci.org/fedelemantuano/tika-app-python
.. |Coverage Status| image:: https://coveralls.io/repos/github/fedelemantuano/tika-app-python/badge.svg?branch=master
:target: https://coveralls.io/github/fedelemantuano/tika-app-python?branch=master
.. |BCH compliance| image:: https://bettercodehub.com/edge/badge/fedelemantuano/tika-app-python?branch=develop
:target: https://bettercodehub.com/
(Python 2)
tika_content_type() 0.704840 sec
tika_detect_language() 1.592066 sec
magic_content_type() 0.000215 sec
tika_extract_all_content() 0.816366 sec
tika_extract_only_content() 0.788667 sec

(Python 3)
tika_content_type() 0.698357 sec
tika_detect_language() 1.593452 sec
magic_content_type() 0.000226 sec
tika_extract_all_content() 0.785915 sec
tika_extract_only_content() 0.766517 sec
1 change: 0 additions & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
chainmap
mail-parser>=3
python-magic
simplejson
Expand Down
9 changes: 9 additions & 0 deletions tests/context.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# -*- coding: utf-8 -*-

import sys
import os
sys.path.insert(0, os.path.abspath(
os.path.join(os.path.dirname(__file__), '..')))

from tikapp import TikaApp
from tikapp.exceptions import *
Binary file added tests/files/pdf1.pdf
Binary file not shown.
9 changes: 3 additions & 6 deletions tests/performance.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,21 +20,18 @@
from __future__ import unicode_literals
import magic
import os
import sys
import timeit

profiling_path = os.path.realpath(os.path.dirname(__file__))
root = os.path.join(profiling_path, '..')
sys.path.append(root)
from tikapp import TikaApp
from context import TikaApp

profiling_path = os.path.realpath(os.path.dirname(__file__))
test_zip = os.path.join(profiling_path, "files", "lorem_ipsum.txt.zip")
test_txt = os.path.join(profiling_path, "files", "lorem_ipsum.txt")

try:
TIKA_APP_JAR = os.environ["TIKA_APP_JAR"]
except KeyError:
TIKA_APP_JAR = "/opt/tika/tika-app-1.15.jar"
TIKA_APP_JAR = "/opt/tika/tika-app-1.18.jar"


def tika_content_type():
Expand Down
Loading

0 comments on commit c008f84

Please sign in to comment.