Plugin crashes when search returns docs containing invalid UTF-8 byte sequences. #101

yaauie · 2019-03-06T21:04:53Z

This is a rephrasing of elastic/logstash#10516, opened by @matteogrolla on 2019-03-06.

I have a document in Elasticsearch that crashes Logstash elasticsearch inputplugin when it tries to read it
the document is reported at the end of the message with the error log reported by logstash.
I'm using logstash to migrate documents from Elasticsearch to Mongo, but when logstash encounters the critical document the input plugin is restarted and starts from the beginning.
I'd like at least to skip the documents that can't be parsed, but I can't find a way to do so.
Can you help me?

P.S. If I create a new document in ES using curl and the textual representation of the critical document given here, I don't get parse error from logstash on this new document

-------Error log-------

[2019-03-06T12:43:47,696][ERROR][logstash.pipeline        ] A plugin had an unrecoverable error. Will restart this plugin.
  Pipeline_id:main
  Plugin: <LogStash::Inputs::Elasticsearch index=>"fulltextmg_33", id=>"3d2d80a0e02debd1b54d39b3e6b88b54a1ea45fe2c8ae8ddf2b0ec42e080ff61", hosts=>["pbauci01"], query=>"{ \"query\": { \"term\": { \"_id\": \"http://www.facebook.com/114701051917886_2073179089403396\"} } }", enable_metric=>true, codec=><LogStash::Codecs::JSON id=>"json_149580ae-80e8-4f8f-8728-66db3890cf1f", enable_metric=>true, charset=>"UTF-8">, size=>1000, scroll=>"1m", docinfo=>false, docinfo_target=>"@metadata", docinfo_fields=>["_index", "_type", "_id"], ssl=>false>
  Error: invalid byte sequence in UTF-8
  Exception: MultiJson::ParseError
  Stack: /opt/logstash-6.6.1/vendor/bundle/jruby/2.3.0/gems/jrjackson-0.4.6-java/lib/jrjackson/jrjackson.rb:91:in `is_time_string?'
/opt/logstash-6.6.1/vendor/bundle/jruby/2.3.0/gems/jrjackson-0.4.6-java/lib/jrjackson/jrjackson.rb:36:in `load'
/opt/logstash-6.6.1/vendor/bundle/jruby/2.3.0/gems/multi_json-1.13.1/lib/multi_json/adapters/jr_jackson.rb:11:in `load'
/opt/logstash-6.6.1/vendor/bundle/jruby/2.3.0/gems/multi_json-1.13.1/lib/multi_json/adapter.rb:21:in `load'
/opt/logstash-6.6.1/vendor/bundle/jruby/2.3.0/gems/multi_json-1.13.1/lib/multi_json.rb:122:in `load'
/opt/logstash-6.6.1/vendor/bundle/jruby/2.3.0/gems/elasticsearch-transport-5.0.5/lib/elasticsearch/transport/transport/serializer/multi_json.rb:24:in `load'
/opt/logstash-6.6.1/vendor/bundle/jruby/2.3.0/gems/elasticsearch-transport-5.0.5/lib/elasticsearch/transport/transport/base.rb:322:in `perform_request'
/opt/logstash-6.6.1/vendor/bundle/jruby/2.3.0/gems/elasticsearch-transport-5.0.5/lib/elasticsearch/transport/transport/http/faraday.rb:20:in `perform_request'
/opt/logstash-6.6.1/vendor/bundle/jruby/2.3.0/gems/elasticsearch-transport-5.0.5/lib/elasticsearch/transport/client.rb:131:in `perform_request'
/opt/logstash-6.6.1/vendor/bundle/jruby/2.3.0/gems/elasticsearch-api-5.0.5/lib/elasticsearch/api/actions/search.rb:183:in `search'
/opt/logstash-6.6.1/vendor/bundle/jruby/2.3.0/gems/logstash-input-elasticsearch-4.2.1/lib/logstash/inputs/elasticsearch.rb:200:in `do_run'
/opt/logstash-6.6.1/vendor/bundle/jruby/2.3.0/gems/logstash-input-elasticsearch-4.2.1/lib/logstash/inputs/elasticsearch.rb:188:in `run'
/opt/logstash-6.6.1/logstash-core/lib/logstash/pipeline.rb:426:in `inputworker'
/opt/logstash-6.6.1/logstash-core/lib/logstash/pipeline.rb:420:in `block in start_input'

[...]

Unfortunately, the document pasted into the original bug report is valid UTF-8, likely a result of the GitHub UI's form auto-coercing from the pasted encoding to UTF-8.

@matteogrolla would you be able to paste the response into a file, and upload the file without any character encoding?

Potentially related:

Elasticsearch accepts invalid utf-8 text (Elasticsearch accepts invalid utf-8 text elastic/elasticsearch#9538; closed as "won't fix")
JrJackson fails to parse valid JSON in UTF-16 and UTF-32 (jrjackson fails to parse valid JSON in UTF-16 and UTF-32 guyboertje/jrjackson#72; opened)

The text was updated successfully, but these errors were encountered:

matteogrolla · 2019-04-03T09:23:36Z

Hi Ry,
I don't understand why you stripped my workaround when you moved the issue.
At minimum It clearly exhibits where the problem comes from.
The workaround isn't the proper solution, since it modifies jrjackson, but it works and could help those who need an urgent solution.

yaauie · 2019-04-03T15:55:55Z

@matteogrolla there was no malicious intent on my part; the issue was initially filed in the wrong place and I attempted to move and link to it in the places where it would be better addressed, but failed to also copy along the commentary.

We are still waiting on a follow-up from you with a document that exhibits the symptom:

Unfortunately, the document pasted into the original bug report is valid UTF-8, likely a result of the GitHub UI's form auto-coercing from the pasted encoding to UTF-8.

@matteogrolla would you be able to paste the response into a file, and upload the file without any character encoding?

matteogrolla · 2019-04-04T10:23:33Z

I've downloaded the content with

curl -X POST http://pbauci01:9200/fulltext_33/_s85f59a70' -H 'cache-control: no-cache' -d '{ 4-965b8
"query": {
"term": { "url": "http://www.facebook.com/114701051917886_2073179089403396"}
}
}' > logstash_problematic_doc.json

and edited the file to keep only the _source field value

logstash_problematic_doc.txt

yaauie mentioned this issue Mar 6, 2019

Elasticsearch input plugin: Parse Error invalid byte sequence in UTF-8 elastic/logstash#10516

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plugin crashes when search returns docs containing invalid UTF-8 byte sequences. #101

Plugin crashes when search returns docs containing invalid UTF-8 byte sequences. #101

yaauie commented Mar 6, 2019

matteogrolla commented Apr 3, 2019

yaauie commented Apr 3, 2019

matteogrolla commented Apr 4, 2019

Plugin crashes when search returns docs containing invalid UTF-8 byte sequences. #101

Plugin crashes when search returns docs containing invalid UTF-8 byte sequences. #101

Comments

yaauie commented Mar 6, 2019

matteogrolla commented Apr 3, 2019

yaauie commented Apr 3, 2019

matteogrolla commented Apr 4, 2019