Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix false multibyte character detection #261

Merged
merged 8 commits into from
Nov 19, 2024

Conversation

sasata299
Copy link
Contributor

@sasata299 sasata299 commented Nov 11, 2024

#257
In this PR, changes are being made to prevent Scandinavian characters from being judged as emoji.

However, this method had a large impact on the validity of email addresses, and for example, email addresses containing Japanese characters were erroneously determined to be VALID.(The expected value is INVALID).

I made the following changes.

  • Basically, do not allow multibyte characters
  • If you want to allow multibyte characters, set it explicitly
ValidEmail2::Address.permitted_multibyte_characters_regex = /[ÆæØøÅåÄäÖöÞþÐð]/

@phoet
Copy link
Collaborator

phoet commented Nov 11, 2024

Interesting, did not know that this is part of the validation.

I guess emojii are actually valid according to wikipedia with SMTPUTF8 https://en.wikipedia.org/wiki/Email_address#Invalid_email_addresses

I havnt actually seen any none-ascii adresses in the wild though. maybe split the validation into the display-name, local and domain parts?

@sasata299
Copy link
Contributor Author

Thank you for reply!

It seems that RFC 6531 permits the use of email addresses containing non-ASCII characters.
https://datatracker.ietf.org/doc/html/rfc6531

However, I feel that the actual cases in which non-ASCII characters can be used are quite limited, such as requiring that the source/destination servers support SMTPUTF8.
(And I have never seen it actually used..)

maybe split the validation into the display-name, local and domain parts?

It is a good policy.
But there is the question of how many email addresses containing non-ascii characters can be used in the real world.

I think we should fix the degraded part first, and then take the time to consider the future policy afterwards.

@micke
Copy link
Owner

micke commented Nov 12, 2024

I’m sorry for introducing this change that might’ve caused problems for you.

I remember having some thoughts after the initial PR that introduced the validation that disallowed multibyte characters here #94 (comment)

I see two possible solutions;

  1. We use this PR but update it to make the allowed multibyte characters regex configurable by the developer.
  2. We try to follow the standards by completely remove the multibyte validation and let the developer implement their own validations to disallow certain characters.

what do you guys think?

@phoet
Copy link
Collaborator

phoet commented Nov 12, 2024

i think that

a) the examples here https://en.wikipedia.org/wiki/Email_address#valid_email_addresses should work as expected
b) validation of display-names is different, so that you can write "Peter Schröder [email protected]"

on top of that, if it's simple enough to access the relevant patterns, one can change the validation easily.

@sasata299
Copy link
Contributor Author

@micke
I am very glad to know that this was already discussed a few years ago. :)

We use this PR but update it to make the allowed multibyte characters regex configurable by the developer.

I think this is a good idea.

Basically, we believe that we should continue to NOT allow multibyte characters as we have in the past.
This would be the least confusing for people using multibyte characters as a language.
(I don't think most users would think of using emojis in their email addresses, and even those that use multibyte characters as a language, like us, rarely use emojis in their email addresses.)

However, the recent PR does allow some characters.
If we are going to be consistent with that, I think it would be very nice to have an option to specify if there are multibyte characters that the developer really wants to allow.

@sasata299
Copy link
Contributor Author

@phoet

validation of display-names is different, so that you can write "Peter Schröder [email protected]"

I agree. Maybe so.
It could be separated from this PR and handled separately.

@sasata299
Copy link
Contributor Author

sasata299 commented Nov 18, 2024

@micke
How about this one?

@sasata299
Copy link
Contributor Author

@micke @phoet

I made the following changes. 6a3e812

  • Basically, do not allow multibyte characters.
  • If you want to allow multibyte characters, set it explicitly

I would think this would cause less confusion since the default behavior is still the same as under ver 6.0.

@micke
Copy link
Owner

micke commented Nov 19, 2024

Looks great! Could you just document the configuration?

@@ -133,7 +140,9 @@ def mx_server_is_in?(domain_list)
def address_contain_emoticons?
return false if @raw_address.nil?

@raw_address.scan(Unicode::Emoji::REGEX).length >= 1
@raw_address.each_char.any? do |char|
Copy link
Collaborator

@phoet phoet Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe i missed the conversation here, but why is scan not used any longer? i think this would be good as a comment here (or even better a test for the edgecase). also, if permitted_multibyte_characters_regex we should probably not do any of this, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe i missed the conversation here, but why is scan not used any longer?

This was a difference that came in this PR, so I simply put it back in.
#257

I don't see any particular problem since each_char also handles one character at a time.

raw_address = 'あいうえお@gmail.com'
raw_address.each_char do |char|
  print char, ' '
end

#=> あ い う え お @ g m a i l . c o m

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this would be good as a comment here (or even better a test for the edgecase). also, if permitted_multibyte_characters_regex we should probably not do any of this, right?

@phoet
I'm sorry.. I didn't properly understand what you meant by your comment, could you please tell me again?

Copy link
Collaborator

@phoet phoet Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a) i would expect to use a regex with scan instead of each_char as it's easier to read and i would guess also much faster. if we dont use it here, we should document why we need to do it differently.

b) if permitted_multibyte_characters_regex is nil i would expect that we do not run the loop and return instead, or skip the call to scan as it does not work with nil

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood.
Thank you. I will correct it. 👍

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, thanks!

@sasata299 sasata299 force-pushed the fix_to_invalid_japanese_addresses branch from 94299bf to bb46da4 Compare November 19, 2024 12:40
@phoet
Copy link
Collaborator

phoet commented Nov 19, 2024

👍 LGTM

lib/valid_email2/address.rb Outdated Show resolved Hide resolved
Co-authored-by: Micke Lisinge <[email protected]>
@micke micke merged commit 7baae98 into micke:main Nov 19, 2024
6 checks passed
@micke
Copy link
Owner

micke commented Nov 19, 2024

Thank you @sasata299 for your work and patience!

micke pushed a commit that referenced this pull request Nov 19, 2024
@micke
Copy link
Owner

micke commented Nov 19, 2024

This has been released in version 7.0.0! Thank you @phoet as well! :)

@sasata299 sasata299 deleted the fix_to_invalid_japanese_addresses branch November 20, 2024 02:10
@mscrivo
Copy link
Contributor

mscrivo commented Nov 20, 2024

I understand the motivation behind this change, and I was not fond of the performance of the previous solution, but just to add a data point: We actually do see email addresses with non-ascii chars in the wild, like Japanese ones, and the previous solution worked fairly well for us. Now instead of disallowing emoji's and allowing all other non-ascii chars, the burden is on us to figure out all the possible multibyte chars we need to support.

@phoet
Copy link
Collaborator

phoet commented Nov 20, 2024

I agree that the gem should work out of the box as most would expect. I think it would be great to have a proper real-life test harness with a couple of examples that should be supported. Maybe you could provide a couple of pseudonomized examples from your dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants