-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is this UTF8-friendly? #1
Comments
It depends on what you are going to achieve. The two likely most common tasks about Unicode are making a given sequence to be string-safe by removing the quote characters and the other one is encoding binary data as a Unicode string. To be efficient, the latter requires an encoding specifically aimed to deal with Unicode, I believe, which is not the case with the escapeless encodings. And for the first case, some generalization of the algorithm would be needed to make sure the quote characters are not mapped to a byte that would result in invalid UTF-8 sequences. |
@kosarev So escapeless is not a Python/JS affair then? (If it is so Unicode-unsafe)? |
Yes, comparing to these Unicode-specific encodings mentioned, escapeless is a different animal. It is most efficient when you need to strip certain characters/bytes from a stream by the cost of a low fixed-size overhead. |
@kosarev so is it possible to create a compatibility format than can be converted from Unicode-safe "alt-format" to Escapeless? without the need for a python-like bytes format? |
If I take the idea right, sure, there should be no problem to use escapeless in the middle of a chain of Unicode-specific encodings. As to representation of binary data, I guess you mean JS, in which case an array of bytes sounds like a good replacement for the Python's byte strings, with likely no changes in the algorithms themselves. |
@kosarev so basically |
Yes, given by |
@kosarev I mean the JSON spec does allow certain "special characters" to slip through, right? |
Well, escapeless wouldn't allow you to exclude those special characters, if that's what you mean, because it has to be in the middle of the encoding chain, that is, it processes purely binary data and so has to be surrounded with Unicode-specific encodings on both the ends of the chain. By removing certain characters from that binary data in the middle we can't generally affect which characters will appear in the encoded JSON string as it depends on that Unicode-specific encoding used. |
@kosarev but escapeless can have down to 225 characters, so surely some of the forbidden code space can be stripped off right? |
It can strip off even more characters, it just won't be efficient comparing to other approaches. Answering your question, the thing is that removing certain characters in binary data doesn't mean these or some other characters will disappear from their Unicode-encoded version, because most likely there will be no 1-to-1 correspondence. |
Was wondering if this will work with UTF8 (and other Unicode encoding)
The text was updated successfully, but these errors were encountered: