Add `CBORGenerator.Feature.LENIENT_UTF_ENCODING` for lenient handling of Unicode surrogate pairs on writing #222

guillaumebort · 2020-09-29T09:30:06Z

If enabled, the generator will output the Unicode Replacement Character for invalid unicode sequence (invalid surrogate chars in the Java String) instead of failing with an IllegalArgumentException.

Also this PR remove the code duplication between _shortUTF8Encode2 and _encode2.

cbor/src/main/java/com/fasterxml/jackson/dataformat/cbor/CBORGenerator.java

cowtowncoder · 2020-09-29T16:51:34Z

Ok, first of all, thank you for the contribution! I am bit concerned about the fact that some code is trying to output invalid Unicode content, essentially, but as long as that is documented and user has to explicitly enable said feature I think that is fine.

I added some small notes in PR itself, but 2 bigger questions:

Since "master" is for 3.0, which might be far out, you may want to instead make PR against 2.12. I can handle merging it forward to master (wrt API changes)
I'll need to have a look but my main concern is with performance: since this is an edge case, it should have no measurable effect on good case (i.e. content does not have invalid surrogate characters). I can probably test it myself but thought I'll mention it -- I think changes do affect main loops.

cowtowncoder · 2020-09-29T16:52:43Z

cbor/src/main/java/com/fasterxml/jackson/dataformat/cbor/CBORGenerator.java

-     * the output buffer, but not all characters are single-byte (ASCII)
-     * characters.
-     */
-    private final int _shortUTF8Encode2(char[] str, int i, int end,


Curious as to why this was removed? Or did it just get moved and diff is confused.

It has been removed because this code was duplicated: _shortUTF8Encode2 and _encode2 were basically the same with the difference that one was taking a String as an argument and the other one was taking a char[]. Since this code contains some really non trivial logic I thought it would be better to not duplicate it.

cbor/src/main/java/com/fasterxml/jackson/dataformat/cbor/CBORGenerator.java

guillaumebort · 2020-09-30T08:09:43Z

Since "master" is for 3.0, which might be far out, you may want to instead make PR against 2.12. I can handle merging it forward to master (wrt API changes)

Ok fair enough I will base the PR on 2.12

If enabled, the generator will output the Unicode Replacement Character for invalid unicode sequence (invalid surrogate chars in the Java String) instead of failing with an IllegalArgumentException

cowtowncoder · 2020-10-15T03:42:32Z

@guillaumebort Apologies for slow follow-up here: now back to getting this merge for 2.12.
One thing I'd need, if I hadn't yet asked would be CLA. It's here:

https://github.com/FasterXML/jackson/blob/master/contributor-agreement.pdf

and usually easiest to print, fill & sign, scan/photo, email to info at fasterxml dot com.
Only needs to be done once before the first contribution. Apologies if we already got one and I somehow missed it.

guillaumebort · 2020-10-16T16:07:12Z

One thing I'd need, if I hadn't yet asked would be CLA. It's here:

Thanks! I need to handle that with the legal team at my company and I do it ASAP.

cowtowncoder · 2020-10-16T22:47:20Z

Sounds good -- we have both individual CLA that I linked earlier (and used by most contributors), as well as Corporate CLA (CCLA), at

https://github.com/FasterXML/jackson/blob/master/contributor-agreement-corporate.txt

if that makes more sense.

guillaumebort · 2020-10-27T16:11:50Z

Should be ok now, my company (Datadog) sent a a signed copy of the Corporate CLA over.

cowtowncoder · 2020-10-28T01:09:31Z

@guillaumebort For some reason I don't see one yet? I assume it'd be sent to [email protected]?

cowtowncoder · 2020-10-28T20:11:56Z

Received the CLA.

cowtowncoder · 2020-10-29T04:47:46Z

Ended up merging this manually, hence closing PR, feature is in, tests, will be included in 2.12.0-rc2.
Thank you again for contributing this! Might make sense to add similar support for Smile, and perhaps other backends too.

guillaumebort force-pushed the lenient-unicode branch 2 times, most recently from fc5aaa5 to df54c36 Compare September 29, 2020 09:44

cowtowncoder reviewed Sep 29, 2020

View reviewed changes

cbor/src/main/java/com/fasterxml/jackson/dataformat/cbor/CBORGenerator.java Show resolved Hide resolved

cowtowncoder reviewed Sep 29, 2020

View reviewed changes

cbor/src/main/java/com/fasterxml/jackson/dataformat/cbor/CBORGenerator.java Show resolved Hide resolved

cowtowncoder reviewed Sep 29, 2020

View reviewed changes

cbor/src/main/java/com/fasterxml/jackson/dataformat/cbor/CBORGenerator.java Outdated Show resolved Hide resolved

cowtowncoder added 2.12 cbor labels Sep 29, 2020

cowtowncoder reviewed Sep 29, 2020

View reviewed changes

cbor/src/main/java/com/fasterxml/jackson/dataformat/cbor/CBORGenerator.java Show resolved Hide resolved

guillaumebort force-pushed the lenient-unicode branch from e73a98e to 39109c6 Compare September 30, 2020 08:28

guillaumebort changed the base branch from master to 2.12 September 30, 2020 08:29

guillaumebort force-pushed the lenient-unicode branch from 39109c6 to b35e164 Compare September 30, 2020 08:32

guillaumebort added 2 commits September 30, 2020 10:33

Add a CBORGenerator feature for lenient unicode encoding

02a2cbc

If enabled, the generator will output the Unicode Replacement Character for invalid unicode sequence (invalid surrogate chars in the Java String) instead of failing with an IllegalArgumentException

Address review comments

5760c70

guillaumebort force-pushed the lenient-unicode branch from b35e164 to 5760c70 Compare September 30, 2020 08:34

cowtowncoder changed the title ~~Add a CBORGenerator feature for lenient unicode encoding~~ Add CBORGenerator.Feature.LENIENT_UTF_ENCODING for lenient handling of Unicode surrogate pairs on writing Oct 29, 2020

cowtowncoder added this to the 2.12.0-rc2 milestone Oct 29, 2020

cowtowncoder added a commit that referenced this pull request Oct 29, 2020

Manually merged #222

314bd30

cowtowncoder closed this Oct 29, 2020

cowtowncoder mentioned this pull request Jun 26, 2021

Add SmileGenerator.Feature.LENIENT_UTF_ENCODING for lenient handling of broken Unicode surrogate pairs on writing #276

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `CBORGenerator.Feature.LENIENT_UTF_ENCODING` for lenient handling of Unicode surrogate pairs on writing #222

Add `CBORGenerator.Feature.LENIENT_UTF_ENCODING` for lenient handling of Unicode surrogate pairs on writing #222

guillaumebort commented Sep 29, 2020

cowtowncoder commented Sep 29, 2020

cowtowncoder Sep 29, 2020

guillaumebort Sep 30, 2020

guillaumebort commented Sep 30, 2020

cowtowncoder commented Oct 15, 2020

guillaumebort commented Oct 16, 2020

cowtowncoder commented Oct 16, 2020

guillaumebort commented Oct 27, 2020

cowtowncoder commented Oct 28, 2020

cowtowncoder commented Oct 28, 2020

cowtowncoder commented Oct 29, 2020

Add CBORGenerator.Feature.LENIENT_UTF_ENCODING for lenient handling of Unicode surrogate pairs on writing #222

Add CBORGenerator.Feature.LENIENT_UTF_ENCODING for lenient handling of Unicode surrogate pairs on writing #222

Conversation

guillaumebort commented Sep 29, 2020

cowtowncoder commented Sep 29, 2020

cowtowncoder Sep 29, 2020

Choose a reason for hiding this comment

guillaumebort Sep 30, 2020

Choose a reason for hiding this comment

guillaumebort commented Sep 30, 2020

cowtowncoder commented Oct 15, 2020

guillaumebort commented Oct 16, 2020

cowtowncoder commented Oct 16, 2020

guillaumebort commented Oct 27, 2020

cowtowncoder commented Oct 28, 2020

cowtowncoder commented Oct 28, 2020

cowtowncoder commented Oct 29, 2020

Add `CBORGenerator.Feature.LENIENT_UTF_ENCODING` for lenient handling of Unicode surrogate pairs on writing #222

Add `CBORGenerator.Feature.LENIENT_UTF_ENCODING` for lenient handling of Unicode surrogate pairs on writing #222