-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
description tag should be in UTF-8 encoding but it is in ASCII-8BIT #15
Comments
Using l.description.to_s.encode('UTF-8', {:invalid => :replace, :undef => :replace, :replace => ''}) solves the issue but we lose the original UNICODE character that was in the source. |
Got same issue |
There is content.force_encoding('binary') in the if condition: def unescape(content)
if content.respond_to?(:force_encoding) && content.force_encoding("binary") =~ /([^-_.!~*'()a-zA-Z\d;\/?:@&=+$,\[\]]%)/n then
CGI.unescape(content).gsub(/(<!\[CDATA\[|\]\]>)/,'').strip
else
content.gsub(/(<!\[CDATA\[|\]\]>)/,'').strip
end
end force_encoding method changes string encoding inplace, so every string returned by simple-rss will be encoded to ASCII 8-bit... I'd rewrite that the following way, but unsure that for this 'if' as well. So I don't make a pull request. def unescape(content)
if content.respond_to?(:force_encoding) && encode_binary(content) =~ /([^-_.!~*'()a-zA-Z\d;\/?:@&=+$,\[\]]%)/n then
CGI.unescape(content).gsub(/(<!\[CDATA\[|\]\]>)/,'').strip
else
content.gsub(/(<!\[CDATA\[|\]\]>)/,'').strip
end
end
def encode_binary(content)
content.encode('binary', {:invalid => :replace, :undef => :replace, :replace => ''})
end |
Hi @evgeniynickolaev can you please test it with a feed that has non latin characters? Meanwhile I will try to post a sample where it failed for me. |
Yes, I've tested it with a feed containing the following unicode symbols - \xE2\x80\x99. |
Just as @evgeniynickolaev pointed out, the immediate source of the problem is I'll throw in a fix which simply removes all the fiddling with encodings. I can't figure out any reason why there would be any need for that. |
I run into the same problem. This gem is not well maintained. I'm go with other gems. |
@chengguangnan what other gem have you found that is well maintained? |
Hi @jeremyhaile, I switched to feedjira. |
Scrubbing an ASCII-8BIT string isn't ever going to remove anything, because there's no code point that isn't valid 8-bit ASCII. Since we'd really prefer it if everything were UTF-8 anyway, we'll just assume, for now, that whatever comes out of SimpleRSS is probably UTF-8, and just nuke anything that isn't a valid UTF-8 codepoint. Of course, the *real* bug here is that SimpleRSS [unilaterally converts everything to ASCII-8BIT](cardmagic/simple-rss#15). It's presumably *far* too much to ask that it detects the encoding of the source RSS feed and marks the parsed strings with the correct encoding...
Tried this also:
But still ending up with:
Uncaught exception: invalid byte sequence in UTF-8
The text was updated successfully, but these errors were encountered: