Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cheerio encodes HTML entities too eagerly #4045

Open
atjn opened this issue Aug 25, 2024 · 3 comments
Open

Cheerio encodes HTML entities too eagerly #4045

atjn opened this issue Aug 25, 2024 · 3 comments

Comments

@atjn
Copy link

atjn commented Aug 25, 2024

Take this simple HTML link:

<a href="https://example.com/?foo=1&bar=1">link</a>

Now run it through this basic script:

const $ = cheerio.load(`<a href="https://example.com/?foo=1&bar=1">link</a>`);

console.log($.html());

The output is: (I manually removed the extra <body> and <html> tags)

<a href="https://example.com/?foo=1&amp;bar=1">link</a>

Notice that the link in the output is incorrect because the & has been replaced with &amp;. If you try to use the output link, it will not set the same query parameters as the original link did.

I think we can all agree that when you load an html document, and then immediately render it without making any changes, the output should be identical and not suddenly have broken links.

I am not sure what needs to be different to support this, but something in the dom-serializer package needs to change. Maybe it should ignore string content, or maybe it shouldn't encode HTML entities by default?

@atjn
Copy link
Author

atjn commented Aug 25, 2024

According to #4029 this is "works as intended". I do still want to keep this issue open though, because I still think it would be useful if this kind of automatic escaping did not happen. Is there any chance that we could have something like that? Maybe just as an option?

@nwalters512
Copy link

Note that I'm not actually a maintainer of Cheerio, so I don't speak for them, I'm just trying to be helpful.

Cheerio is not producing a broken link. What Cheerio produces is 100% valid HTML that will be understandable by any browser, parser, etc. that follows the HTML specification. The fact that attributes can contain character references is an inherent part of the HTML spec (https://html.spec.whatwg.org/multipage/syntax.html#attributes-2). If you have a raw HTML document and you try to use an attribute value verbatim without first parsing it per the HTML specification, you're going to have a bad time. As I noted on the other issue, Cheerio will happily give you the decoded value if you use .attr('href') or the like.

@atjn
Copy link
Author

atjn commented Aug 27, 2024

@nwalters512 thanks for trying to help. I understand that what Cheerio does is technically compliant, but in my use case, it seems bad.

I want to use Cheerio to edit an existing HTML file which will later be touched by humans. If I am a developer working on the file that Cheerio spat out, I would be really confused to see escape characters in my URLs. Not only is it hard to mentally parse, if I try to copy paste the link into a browser, or if have a fancy code editor where I can click to open the link, I will be taken to the wrong URL because the browser doesn't attempt to perform HTML-decoding on a URL that is provided directly by the user.
This would make me pretty frustrated and prompt me to manually change all the HTML encoded characters to their original counterparts. That would last a few hours until someone uses Cheerio to edit the file again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants