Henkei 変形

Henkei is a library for extracting text and metadata from files and documents using the Apache Tika content analysis toolkit.

The library was forked from Yomu as it is no longer maintained.

Here are some of the formats supported:

Microsoft Office OLE 2 and Office Open XML Formats (.doc, .docx, .xls, .xlsx, .ppt, .pptx)
OpenOffice.org OpenDocument Formats (.odt, .ods, .odp)
Apple iWorks Formats
Rich Text Format (.rtf)
Portable Document Format (.pdf)

For the complete list of supported formats, please visit the Apache Tika Supported Document Formats page.

Upgrading from v1.x to v2.x

Apache Tika v2.x brings with it some changes. One key change is that the Tika client and server applications have been split up. To keep the gem size down Henkei will only include the client app. That is to say, each time you call to Henkei, a new Java process will be started, run your command, then terminate.

Another change is the metadata keys. A lot of duplicate keys have been removed in favour of a more standards based approach. A list of the old vs new key names can be found here

Usage

Text, metadata and MIME type information can be extracted by calling Henkei.read directly:

require 'henkei'

data = File.read 'sample.pages'
text = Henkei.read :text, data
metadata = Henkei.read :metadata, data
mimetype = Henkei.read :mimetype, data

Henkei is backward compatible with Yomu

text = Yomu.read :text, data

Reading text from a given filename

Create a new instance of Henkei and pass a filename.

henkei = Henkei.new 'sample.pages'
text = henkei.text

Reading text from a given URL

This is useful for reading remote files, like documents hosted on Amazon S3.

henkei = Henkei.new 'http://svn.apache.org/repos/asf/poi/trunk/test-data/document/sample.docx'
text = henkei.text

Reading text from a stream

Henkei can also read from a stream or any object that responds to read, including file uploads from Ruby on Rails or Sinatra.

post '/:name/:filename' do
  henkei = Henkei.new params[:data][:tempfile]
  henkei.text
end

Reading text from inside images (OCR)

You can enable OCR by specifying the optional include_ocr: true when calling to the text or html instance methods, as well as the read class method. Note that Tika does indicate this will greatly increase processing time.

henkei = Henkei.new 'sample.pages'
text_with_ocr = henkei.text(include_ocr: true)
html_with_ocr = henkei.html(include_ocr: true)

data = File.read 'sample.pages'
text_with_ocr = Henkei.read :text, data, include_ocr: true

Reading metadata

Metadata is returned as a hash.

henkei = Henkei.new 'sample.pages'
henkei.metadata['Content-Type'] #=> "application/vnd.apple.pages"

Reading MIME types

MIME type is returned as a MIME::Type object.

henkei = Henkei.new 'sample.docx'
henkei.mimetype.content_type #=> "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
henkei.mimetype.extensions #=> ['docx']

Output text in a specific character encoding

You can specify the output character encoding by passing in the optional encoding argument when calling to the text or html instance methods, as well as the read class method.

henkei = Henkei.new 'sample.pages'
utf_8_text = henkei.text(encoding: 'UTF-8')
utf_16_html = henkei.html(encoding: 'UTF-16')

data = File.read 'sample.pages'
utf_32_text = Henkei.read :text, data, encoding: 'UTF-32'

Installation and Dependencies

Java Runtime

Henkei packages the Apache Tika application jar and requires a working JRE for it to work. Check that you either have the JAVA_HOME environment variable set, or that java is in your path.

Gem

Add this line to your application's Gemfile:

gem 'henkei'

And then execute:

$ bundle

Or install it yourself as:

$ gem install henkei

Heroku

Add the JVM Buildpack to your Heroku project:

$ heroku buildpacks:add heroku/jvm --index 1 -a YOUR_APP_NAME

Contributing

Fork it
Create your feature branch ( git checkout -b my-new-feature )
Create tests and make them pass ( rake test )
Commit your changes ( git commit -am 'Added some feature' )
Push to the branch ( git push origin my-new-feature )
Create a new Pull Request

Name		Name	Last commit message	Last commit date
Latest commit History 164 Commits
.github/workflows		.github/workflows
bin		bin
jar		jar
lib		lib
spec		spec
.gitignore		.gitignore
.rspec		.rspec
.rubocop.yml		.rubocop.yml
Gemfile		Gemfile
LICENSE		LICENSE
NOTICE.txt		NOTICE.txt
README.md		README.md
Rakefile		Rakefile
henkei.gemspec		henkei.gemspec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Henkei 変形

Upgrading from v1.x to v2.x

Usage

Reading text from a given filename

Reading text from a given URL

Reading text from a stream

Reading text from inside images (OCR)

Reading metadata

Reading MIME types

Output text in a specific character encoding

Installation and Dependencies

Java Runtime

Gem

Heroku

Contributing

About

Releases 44

Packages

Languages

License

abrom/henkei

Folders and files

Latest commit

History

Repository files navigation

Henkei 変形

Upgrading from v1.x to v2.x

Usage

Reading text from a given filename

Reading text from a given URL

Reading text from a stream

Reading text from inside images (OCR)

Reading metadata

Reading MIME types

Output text in a specific character encoding

Installation and Dependencies

Java Runtime

Gem

Heroku

Contributing

About

Resources

License

Stars

Watchers

Forks

Releases 44

Packages 0

Languages

Packages