Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CBORGenerator.Feature.LENIENT_UTF_ENCODING for lenient handling of Unicode surrogate pairs on writing #222

Closed
wants to merge 2 commits into from

Conversation

guillaumebort
Copy link

If enabled, the generator will output the Unicode Replacement Character for invalid unicode sequence (invalid surrogate chars in the Java String) instead of failing with an IllegalArgumentException.

Also this PR remove the code duplication between _shortUTF8Encode2 and _encode2.

@guillaumebort guillaumebort force-pushed the lenient-unicode branch 2 times, most recently from fc5aaa5 to df54c36 Compare September 29, 2020 09:44
@cowtowncoder
Copy link
Member

Ok, first of all, thank you for the contribution! I am bit concerned about the fact that some code is trying to output invalid Unicode content, essentially, but as long as that is documented and user has to explicitly enable said feature I think that is fine.

I added some small notes in PR itself, but 2 bigger questions:

  1. Since "master" is for 3.0, which might be far out, you may want to instead make PR against 2.12. I can handle merging it forward to master (wrt API changes)
  2. I'll need to have a look but my main concern is with performance: since this is an edge case, it should have no measurable effect on good case (i.e. content does not have invalid surrogate characters). I can probably test it myself but thought I'll mention it -- I think changes do affect main loops.

* the output buffer, but not all characters are single-byte (ASCII)
* characters.
*/
private final int _shortUTF8Encode2(char[] str, int i, int end,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious as to why this was removed? Or did it just get moved and diff is confused.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has been removed because this code was duplicated: _shortUTF8Encode2 and _encode2 were basically the same with the difference that one was taking a String as an argument and the other one was taking a char[]. Since this code contains some really non trivial logic I thought it would be better to not duplicate it.

@guillaumebort
Copy link
Author

  1. Since "master" is for 3.0, which might be far out, you may want to instead make PR against 2.12. I can handle merging it forward to master (wrt API changes)

Ok fair enough I will base the PR on 2.12

@guillaumebort guillaumebort changed the base branch from master to 2.12 September 30, 2020 08:29
If enabled, the generator will output the Unicode Replacement Character
for invalid unicode sequence (invalid surrogate chars in the Java
String) instead of failing with an IllegalArgumentException
@cowtowncoder
Copy link
Member

@guillaumebort Apologies for slow follow-up here: now back to getting this merge for 2.12.
One thing I'd need, if I hadn't yet asked would be CLA. It's here:

https://github.com/FasterXML/jackson/blob/master/contributor-agreement.pdf

and usually easiest to print, fill & sign, scan/photo, email to info at fasterxml dot com.
Only needs to be done once before the first contribution. Apologies if we already got one and I somehow missed it.

@guillaumebort
Copy link
Author

One thing I'd need, if I hadn't yet asked would be CLA. It's here:

Thanks! I need to handle that with the legal team at my company and I do it ASAP.

@cowtowncoder
Copy link
Member

Sounds good -- we have both individual CLA that I linked earlier (and used by most contributors), as well as Corporate CLA (CCLA), at

https://github.com/FasterXML/jackson/blob/master/contributor-agreement-corporate.txt

if that makes more sense.

@guillaumebort
Copy link
Author

Should be ok now, my company (Datadog) sent a a signed copy of the Corporate CLA over.

@cowtowncoder
Copy link
Member

@guillaumebort For some reason I don't see one yet? I assume it'd be sent to [email protected]?

@cowtowncoder
Copy link
Member

Received the CLA.

@cowtowncoder cowtowncoder changed the title Add a CBORGenerator feature for lenient unicode encoding Add CBORGenerator.Feature.LENIENT_UTF_ENCODING for lenient handling of Unicode surrogate pairs on writing Oct 29, 2020
@cowtowncoder cowtowncoder added this to the 2.12.0-rc2 milestone Oct 29, 2020
cowtowncoder added a commit that referenced this pull request Oct 29, 2020
@cowtowncoder
Copy link
Member

Ended up merging this manually, hence closing PR, feature is in, tests, will be included in 2.12.0-rc2.
Thank you again for contributing this! Might make sense to add similar support for Smile, and perhaps other backends too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants