-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix false multibyte character detection #261
Fix false multibyte character detection #261
Conversation
Interesting, did not know that this is part of the validation. I guess emojii are actually valid according to wikipedia with SMTPUTF8 https://en.wikipedia.org/wiki/Email_address#Invalid_email_addresses I havnt actually seen any none-ascii adresses in the wild though. maybe split the validation into the display-name, local and domain parts? |
Thank you for reply! It seems that RFC 6531 permits the use of email addresses containing non-ASCII characters. However, I feel that the actual cases in which non-ASCII characters can be used are quite limited, such as requiring that the source/destination servers support SMTPUTF8.
It is a good policy. I think we should fix the degraded part first, and then take the time to consider the future policy afterwards. |
I’m sorry for introducing this change that might’ve caused problems for you. I remember having some thoughts after the initial PR that introduced the validation that disallowed multibyte characters here #94 (comment) I see two possible solutions;
what do you guys think? |
i think that a) the examples here https://en.wikipedia.org/wiki/Email_address#valid_email_addresses should work as expected on top of that, if it's simple enough to access the relevant patterns, one can change the validation easily. |
@micke
I think this is a good idea. Basically, we believe that we should continue to NOT allow multibyte characters as we have in the past. However, the recent PR does allow some characters. |
I agree. Maybe so. |
@micke |
Looks great! Could you just document the configuration? |
lib/valid_email2/address.rb
Outdated
@@ -133,7 +140,9 @@ def mx_server_is_in?(domain_list) | |||
def address_contain_emoticons? | |||
return false if @raw_address.nil? | |||
|
|||
@raw_address.scan(Unicode::Emoji::REGEX).length >= 1 | |||
@raw_address.each_char.any? do |char| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe i missed the conversation here, but why is scan
not used any longer? i think this would be good as a comment here (or even better a test for the edgecase). also, if permitted_multibyte_characters_regex
we should probably not do any of this, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe i missed the conversation here, but why is scan not used any longer?
This was a difference that came in this PR, so I simply put it back in.
#257
I don't see any particular problem since each_char
also handles one character at a time.
raw_address = 'あいうえお@gmail.com'
raw_address.each_char do |char|
print char, ' '
end
#=> あ い う え お @ g m a i l . c o m
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this would be good as a comment here (or even better a test for the edgecase). also, if permitted_multibyte_characters_regex we should probably not do any of this, right?
@phoet
I'm sorry.. I didn't properly understand what you meant by your comment, could you please tell me again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a) i would expect to use a regex with scan
instead of each_char
as it's easier to read and i would guess also much faster. if we dont use it here, we should document why we need to do it differently.
b) if permitted_multibyte_characters_regex
is nil
i would expect that we do not run the loop and return instead, or skip the call to scan
as it does not work with nil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood.
Thank you. I will correct it. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool, thanks!
94299bf
to
bb46da4
Compare
👍 LGTM |
Co-authored-by: Micke Lisinge <[email protected]>
Thank you @sasata299 for your work and patience! |
…byte characters configurable (#261)
This has been released in version 7.0.0! Thank you @phoet as well! :) |
I understand the motivation behind this change, and I was not fond of the performance of the previous solution, but just to add a data point: We actually do see email addresses with non-ascii chars in the wild, like Japanese ones, and the previous solution worked fairly well for us. Now instead of disallowing emoji's and allowing all other non-ascii chars, the burden is on us to figure out all the possible multibyte chars we need to support. |
I agree that the gem should work out of the box as most would expect. I think it would be great to have a proper real-life test harness with a couple of examples that should be supported. Maybe you could provide a couple of pseudonomized examples from your dataset. |
#257
In this PR, changes are being made to prevent Scandinavian characters from being judged as emoji.
However, this method had a large impact on the validity of email addresses, and for example, email addresses containing Japanese characters were erroneously determined to be VALID.(The expected value is INVALID).
I made the following changes.