Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for iso-8859-1, utf8 "sentinel" and numeric entities #268

Merged
merged 12 commits into from
Aug 10, 2018

Conversation

papandreou
Copy link
Contributor

Context: expressjs/body-parser#194

I've rebased the branch now, and the tests pass.

Looking at the documentation now I learn that support for custom encoders and decoders has landed in the mean time. I'm not quite sure if I should try to change this PR to use that, or if it's OK to have iso-8859-1 as a "first class" charset.

A point in favor of having built-in support is that all the heavy lifting is done by the arcane unescape function so it doesn't require iconv-lite or iconv.

I don't think it would be possible to implement the charset switching via the utf8 sentinel within that, as it requires state keeping between the parsing of the individual params.

…so-8859-1, eg. &#9786 => ☺.

This is what browsers send when the user tries to submit characters that aren't available in the charset being used to submit the form.
Copy link
Owner

@ljharb ljharb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also add a utf8Sentinel option to stringify that can produce the utf8 param.

lib/parse.js Outdated
}
if (has.call(obj, key)) {
obj[key] = [].concat(obj[key]).concat(val);
if (key === 'utf8' && options.utf8Sentinel) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the utf-8 key would only matter with the utf8Sentinel option?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the feature is a bit opinionated, so I figured that we'd treat utf8=... as a regular parameter when it's not turned on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the user be able to customize the name of this parameter or is this following some specific standard?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not an actual standard, but utf8=✓ seemed to be a de facto standard back when I made this originally. Back when the internet was still in the process of switching to utf-8 and iso-8859-1 was more prevalent :)

I don't think it'd be relevant to let the user customize the parameter name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. Can you add this to the readme somewhere? For example how will folks know how to use it, especially if it's not an actual standard they may have learned about elsewhere?

Copy link
Owner

@ljharb ljharb Jul 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is to make true mean "utf8" and a check mark, but make an object able to override either or both? (we'd still want it in the readme)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will get back to this soon and add docs etc. Still not sure I see the value of customizing the parameter name etc. Feels like a "when in doubt, leave it out" scenario until someone actually requests it.

I just noticed that github is also using utf8=✓. If you go to "Files changed" in this PR, then change the diff settings, you'll end up at a url like this:

https://github.com/ljharb/qs/pull/268/files?utf8=%E2%9C%93&diff=unified

... so RoR probably still does it by default.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that making it maximally configurable right out of the gate, but also pleasant for the common defaults, is a good idea.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a bit of docs in 452008d.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not sure I get why maximum configurability for the utf8Sentinel parameter would be a good thing when we don't know that anyone will ever need it. That makes it easy to get the configuration parameters wrong, but at that point there'd be no going back. Also it'd add complexity because we'd need to dynamically compute the misinterpreted-as-iso-8859-1 versions of it.

lib/parse.js Outdated
if (has.call(obj, key)) {
obj[key] = [].concat(obj[key]).concat(val);
if (key === 'utf8' && options.utf8Sentinel) {
if (val === '✓' || val === '\xe2\x9c\x93') {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is '\xe2\x9c\x93'?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a string representation of the 3 octets in the utf-8 encoding of the character.

decodeURIComponent('%e2%9c%93')
'✓'

Checking for it here makes sure that we will switch into utf-8 mode when the parameter was parsed as iso-8859-1 because of defaultCharset. It's what makes this test pass.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a standard for using this character? Should it be customizable, especially if there lacks a standard?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

lib/parse.js Outdated
if (key === 'utf8' && options.utf8Sentinel) {
if (val === '✓' || val === '\xe2\x9c\x93') {
charset = 'utf-8';
} else if (val === '✓') {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the "numeric encoding" of .

This is what browsers will submit in an application/x-www-form-urlencoded body when the encoding of the page containing the form is iso-8859-1, or when the submitted form has an accept-charset attribute of iso-8859-1. Presumably also with other charsets that does no contain the character, such as us-ascii.

Thus we can use it as an indicator that the rest of the query string or application/x-www-form-urlencoded is iso-8859-1.

This behavior is covered by these two tests:

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we store these kinds of values in named variables, so that they're self-documenting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, 7aeb51e

lib/utils.js Outdated
var decode = function (str, decoder, charset) {
var strWithoutPlus = str.replace(/\+/g, ' ');
if (charset === 'iso-8859-1') {
return unescape(strWithoutPlus); // Cannot throw
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you enumerate how unescape differs from decodeURIComponent with this charset? when would the latter throw and the former do the correct thing?

Copy link
Contributor

@dougwilson dougwilson Jul 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think unescape will decode the format %uXXXX in addition to the standard %XX format. This seems like an undeserved behavior, and at best just weird that the %uXXXX format will work if you use latain 1 but not if you use utf8. It should probably accept %uXXXX in both or neither.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unescape is a legacy global function that converts string => string, interpreting percent-encoded octets to be in iso-8859-1. Percent not followed by two hex chars are left as-is, and I have never succeeded in getting it to throw for any string value, so that's why it doesn't seem necessary to put it in a try...catch.

decodeURIComponent is similar, but interprets the percent-encoded octets as utf-8. It'll throw a URIError if percent is not followed by two hex chars, or if sequences of percent-encoded octets don't form a valid utf-8 character.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Percent not followed by two hex chars are left as-is

This is not correct, see my comment above yours about %uXXXX . Can you make it not decode that format and only touch the %XX ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you're right. I had forgotten about that. Fixed in bd9bc06.

test/parse.js Outdated
});

t.test('prefers an utf-8 charset specified by the utf8 sentinel to a default charset of iso-8859-1', function (st) {
st.deepEqual(qs.parse('utf8=%E2%9C%93&%C3%B8=%C3%B8', { utf8Sentinel: true, charset: 'iso-8859-1' }), { ø: 'ø' });
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it'd be great if we could use named variables to explain what these values are, and why they're significant

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, attempted in papandreou@afb4a16

@papandreou
Copy link
Contributor Author

You're right, I didn't think of stringify at all. I hope noone is in a situation where they need to talk to a legacy iso-8859-1-only service like this, but it would be nice for symmetry, if nothing else. Also a charset: 'iso-8859-1' mode, I guess?

papandreou added a commit to papandreou/qs that referenced this pull request Jul 25, 2018
@papandreou
Copy link
Contributor Author

papandreou commented Jul 25, 2018

Do you think adding charset and utf8Sentinel support to stringify is a requirement for this PR, or is it something we can add later?

@dougwilson
Copy link
Contributor

I never realized that Github used this, that is neat. I was just playing around and the only difference it seems from this implementation and Github is that this requires that utf8 be the first parameter but Github doesn't seem to have that restriction.

Ex https://github.com/ljharb/qs/search?q=👌&utf8=✓&type=

@dougwilson
Copy link
Contributor

But I didn't try very much yet besides some url tweaks in Chrome.

Means that it doesn't have to be the first parameter to affect
the parsing of the entire query string.

ljharb#268 (comment)
papandreou added a commit to papandreou/qs that referenced this pull request Jul 26, 2018
Means that it doesn't have to be the first parameter to affect
the parsing of the entire query string.

ljharb#268 (comment)
@papandreou
Copy link
Contributor Author

@dougwilson, you're right, I also thought about that the other day. It's something that wasn't necessary for my original implementation as I had complete control of the <form>s being posted :)

Fixed in 603c49a.

@papandreou
Copy link
Contributor Author

I implemented support for charset and utf8Sentinel in qs.stringify here: papandreou/qs@feature/iso8859-1...papandreou:feature/stringifyIso

lib/parse.js Outdated
} else if (parts[i] === isoSentinel) {
charset = 'iso-8859-1';
}
parts.splice(i, 1);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oof, can splice be avoided?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I guess we can store a skipIndex in a variable and use it to ignore the sentinel argument in the second loop? Does that sound better?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

splice mutates, is O(n), and is the singularly most complex API method in JavaScript, so if it's not an imposition, yes that would be great :-D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 4e65517.

lib/parse.js Outdated
// the form is iso-8859-1, or when the submitted form has an accept-charset
// attribute of iso-8859-1. Presumably also with other charsets that do not contain
// the ✓ character, such as us-ascii.
var isoSentinel = 'utf8=' + encodeURIComponent('&#10003;'); // %26%2310003%3B
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this is fine, but i'd also be ok with the literal hardcoded value and the commented part being the code that would generate it at runtime)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 228bf8e.

lib/parse.js Outdated
if (options.utf8Sentinel) {
for (i = 0; i < parts.length; ++i) {
if (parts[i].indexOf('utf8=') === 0) {
if (parts[i] === utf8Sentinel) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this logic seems like it could be if parts[i] === utf8Sentinel, else if parts[i] === isoSentinel, else if parts[i].indexOf('utf8=') === 0 as well (i might be prematurely optimizing, or maybe it would be cleaner, i'm not sure)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we'd need to introduce a foundSentinel variable that gets set in each of the three cases so that we can break out of the loop? Or put the skipIndex=i; i=parts.length; bit in all three cases?

Copy link
Contributor Author

@papandreou papandreou Jul 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... We could also define that a utf8=... parameter we don't explicitly recognize as a checkmark in any character set should just be parsed as a regular parameter? I'm in doubt myself.

var parseValues = function parseQueryStringValues(str, options) {
var obj = {};
var cleanStr = options.ignoreQueryPrefix ? str.replace(/^\?/, '') : str;
var limit = options.parameterLimit === Infinity ? undefined : options.parameterLimit;
var parts = cleanStr.split(options.delimiter, limit);
var charset = options.charset;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we throw an exception here if this value is anything but undefined, iso-8859-1, or utf-8?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! Fixed in 8deaab7.

lib/parse.js Outdated
for (i = 0; i < parts.length; ++i) {
if (parts[i].indexOf('utf8=') === 0) {
if (parts[i] === utf8Sentinel) {
charset = 'utf-8';
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if the user passes in iso-8859-1 and this logic sets it to utf-8? should that be allowed, or should this logic be skipped when charset is passed, or should it throw when they're different?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's exactly the intention -- that the character set inferred from the sentinel can should take precedence. When you specify both a charset and utf8Sentinel:true, the charset acts as more of a default charset.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm. when an explicit charset is provided, it seems to me that the user is saying "i don't care what you think it is, treat it like it's this" - if it's a default charset, i'd expect it to be called defaultCharset when used along with the utf8 sentinel option. Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yeah, I see the point, even though adding another option also contributes to the complexity.

Then qs.parse(..., { utf8Sentinel: true, charset: 'iso-8859-1' }) would throw because that combination of options don't make sense?

Another concern is the symmetry with qs.stringify. Here qs.stringify(..., { utf8Sentinel: true, charset: 'iso-8859-1'}) does make sense. It'd be a bit weird to force defaultCharset for parse, but not stringify.

On balance I think it'd be best to keep it as charset in all cases, but very explicitly point out how the charset+utf8Sentinel combo is interpreted in the utf8Sentinel documentation. WDYT?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm. consistency between parse and stringify is important, for sure.

I think maybe parse should indeed throw on { utf8Sentinel: true, charset: 'anything that's not utf-8' }, although that makes sense for stringify (where it should work) - and then we can keep the option called "charset".

Copy link
Contributor Author

@papandreou papandreou Jul 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would prohibit my original use case for these additions: A service compatible with the ancient FormMail.pl script (originally written in the nineties). It has to assume iso-8859-1 because of all the ancient HTML snippets on iso-8859-1 pages pointed at it (with no sentinel).
However, more “modern” pages can explicitly switch to utf-8 mode by including the sentinel.
This requires the utf8Sentinel: true, charset: 'iso-8859-1' combo.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahhh that makes sense.

alrighty, thanks for bearing with me on the discussion. let's keep what you've got.

papandreou added a commit to papandreou/qs that referenced this pull request Jul 27, 2018
papandreou added a commit to papandreou/qs that referenced this pull request Jul 27, 2018
@papandreou
Copy link
Contributor Author

I've addressed or replied to all the feedback, looking forward to further discussions :)

Copy link
Owner

@ljharb ljharb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to rebase this down a bit and take another pass locally, but I think your work is done :-) thanks for bearing with me on the back-and-forth!

@ljharb
Copy link
Owner

ljharb commented Jul 31, 2018

@papandreou it would be nice to pull in the changes here - want to do that in this PR, or in a separate one?

@papandreou
Copy link
Contributor Author

Both done now. Looking forward to see this merged so I can move on to body-parser ;)

@ljharb @dougwilson, thanks for the thorough reviews!

@dougwilson
Copy link
Contributor

so I can move on to body-parser ;)

Body-parser only accepts options that work in both qs and querystring equally.

@papandreou
Copy link
Contributor Author

@dougwilson, hmm, that's bad news for completing the work laid out in expressjs/body-parser#194 :(

@dougwilson
Copy link
Contributor

I'm not sure what the implementation would look like but I'm not sure it's impossible to do at all. What I think you implemented here was a two pass solution, right? First look through the parameters for sentinal and second decode. I believe you should be able to do those things already with querystring, yes? It supports a custom decoding function.

@dougwilson
Copy link
Contributor

Basically unless I'm missing something here, the current implementation of the parse changes in this PR could be implemented by a user without ever have modifying the qs module at all.

@papandreou
Copy link
Contributor Author

@dougwilson, damn, I think you're right. That must have been added after I did the original work. I should probably rethink the whole thing then.

@dougwilson
Copy link
Contributor

Yea, I'm only thinking of the current implementation, of course. I think once it transformed into a two pass is when it changed.

Now, don't get me wrong, it's still useful being part of qs directly if it lands as folks can use it directly (especially for query strings -- body parser only operates on the request entity not the url).

It could also just pass to the native impl for qs and use a wrapper for querystring. I believe body parser is doing that currently for at least one of the options (wrapping one of the impls that is different).

@papandreou
Copy link
Contributor Author

@dougwilson, seems like that worked, I've opened a PR here: expressjs/body-parser#326

Copy link
Owner

@ljharb ljharb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I should have this merged and released by this weekend.

README.md Outdated
var sentinel = qs.stringify({ a: '☺' }, { utf8Sentinel: true });
assert.equal(sentinel, 'utf8=%E2%9C%93&a=%E2%98%BA');

var isoSentinel = qs.stringify({ a: 'æ' }, { utf8Sentinel: true, charset: 'iso-8859-1' });
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, this seems a little strange - why would you want the sentinel if you were encoding in “not utf-8”?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Sorry about my earlier comment, I thought this was in qs.parse :)

It might be a slim use case, but including a sentinel that explicitly says that the query string is in iso-8859-1 could be useful in some scenarios. Obviously it's much more useful in the inverse case, because all of this has to do with legacy services that just assume it's iso-8859-1.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, but the option is called utf8Sentinel - why would that appear when it's not utf-8?

Copy link
Contributor Author

@papandreou papandreou Aug 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only because the parameter is called utf8. I agree that the name of the option isn’t ideal, but couldn’t come up with anything else. I’ve tried googling it to see if there’s an established name for the concept, but nothing turned up.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

charsetSentinel? encodingSentinel?

When you say "could be useful", is this a use case you've ever seen anywhere? Or are we just planning for something that might happen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like charsetSentinel better, let's go with that instead, fixed in 2efe555.

Well, the test suite for that FormMail.pl-compatible service could have used the ability to encode a query string and an application/x-www-form-urlencoded body with an iso-8859-1 sentinel, as that's one of the cases that explicitly had to be supported.

In other words, it's useful whenever you want to emulate what a browser sends when a form containing utf8=✓ is submitted. So mostly in test suites for web servers that cannot assume that everything is utf-8.

lib/utils.js Outdated
@@ -124,6 +130,12 @@ var encode = function encode(str) {

var string = typeof str === 'string' ? str : String(str);

if (charset === 'iso-8859-1') {
return escape(string).replace(/%u[0-9a-f]{4}/gi, function ($0) {
return '%26%23' + parseInt($0.substr(2), 16) + '%3B';
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should use slice; substr is nonstandard or deprecated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good shout, fixed in 6ff0e39

papandreou added a commit to papandreou/qs that referenced this pull request Aug 2, 2018
papandreou added a commit to papandreou/body-parser that referenced this pull request Aug 2, 2018
ljharb pushed a commit to papandreou/qs that referenced this pull request Aug 10, 2018
ljharb pushed a commit to papandreou/qs that referenced this pull request Aug 10, 2018
Means that it doesn't have to be the first parameter to affect
the parsing of the entire query string.

ljharb#268 (comment)
ljharb pushed a commit to papandreou/qs that referenced this pull request Aug 10, 2018
ljharb pushed a commit to papandreou/qs that referenced this pull request Aug 10, 2018
ljharb pushed a commit to papandreou/qs that referenced this pull request Aug 10, 2018
Means that it doesn't have to be the first parameter to affect
the parsing of the entire query string.

ljharb#268 (comment)
ljharb pushed a commit to papandreou/qs that referenced this pull request Aug 10, 2018
@ljharb ljharb merged commit 286c4bd into ljharb:master Aug 10, 2018
wesleytodd pushed a commit to expressjs/body-parser that referenced this pull request Jul 30, 2024
…326)

* urlencoded, in extended mode: Support iso-8859-1 encoded requests, and also accept iso-8859-1 as a default encoding.

* urlencoded: Support an utf8 sentinel to detect the charset.

* Pass the interpretNumericEntities option through to qs.

* Fix lint

* Support charsets, sentinels etc. via custom decoders

Works in both extended and simple mode.

* Simplify

* Fix empty parameter issue with utf8Sentinel in simple mode

* Run all the charset/sentinel tests in both extended and simple modes

* utf8Sentinel => charsetSentinel

ljharb/qs#268 (comment)

* Update qs to 6.9.1

* Always use the qs module, even in simple mode

#326 (comment)

* Create the simple and extended parser with the same function, reducing duplication

* Don't mention the querystring module in the README

* Fix lint

* Update qs to 6.9.4

* Consistently call it "utf8 sentinel"

* Simplify by relying on the qs module's support for detecting the charset
#326 (comment)

* Simplify further

* Put back debug option

* Document that defaultCharset defaults to utf-8
#326 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants