Fix false multibyte character detection #261

sasata299 · 2024-11-11T09:10:54Z

#257
In this PR, changes are being made to prevent Scandinavian characters from being judged as emoji.

However, this method had a large impact on the validity of email addresses, and for example, email addresses containing Japanese characters were erroneously determined to be VALID.(The expected value is INVALID).

I made the following changes.

Basically, do not allow multibyte characters
If you want to allow multibyte characters, set it explicitly

ValidEmail2::Address.permitted_multibyte_characters_regex = /[ÆæØøÅåÄäÖöÞþÐð]/

phoet · 2024-11-11T10:20:27Z

Interesting, did not know that this is part of the validation.

I guess emojii are actually valid according to wikipedia with SMTPUTF8 https://en.wikipedia.org/wiki/Email_address#Invalid_email_addresses

I havnt actually seen any none-ascii adresses in the wild though. maybe split the validation into the display-name, local and domain parts?

sasata299 · 2024-11-12T02:31:49Z

Thank you for reply!

It seems that RFC 6531 permits the use of email addresses containing non-ASCII characters.
https://datatracker.ietf.org/doc/html/rfc6531

However, I feel that the actual cases in which non-ASCII characters can be used are quite limited, such as requiring that the source/destination servers support SMTPUTF8.
(And I have never seen it actually used..)

maybe split the validation into the display-name, local and domain parts?

It is a good policy.
But there is the question of how many email addresses containing non-ascii characters can be used in the real world.

I think we should fix the degraded part first, and then take the time to consider the future policy afterwards.

micke · 2024-11-12T07:23:46Z

I’m sorry for introducing this change that might’ve caused problems for you.

I remember having some thoughts after the initial PR that introduced the validation that disallowed multibyte characters here #94 (comment)

I see two possible solutions;

We use this PR but update it to make the allowed multibyte characters regex configurable by the developer.
We try to follow the standards by completely remove the multibyte validation and let the developer implement their own validations to disallow certain characters.

what do you guys think?

phoet · 2024-11-12T13:20:01Z

i think that

a) the examples here https://en.wikipedia.org/wiki/Email_address#valid_email_addresses should work as expected
b) validation of display-names is different, so that you can write "Peter Schröder [email protected]"

on top of that, if it's simple enough to access the relevant patterns, one can change the validation easily.

sasata299 · 2024-11-13T02:20:20Z

@micke
I am very glad to know that this was already discussed a few years ago. :)

We use this PR but update it to make the allowed multibyte characters regex configurable by the developer.

I think this is a good idea.

Basically, we believe that we should continue to NOT allow multibyte characters as we have in the past.
This would be the least confusing for people using multibyte characters as a language.
(I don't think most users would think of using emojis in their email addresses, and even those that use multibyte characters as a language, like us, rarely use emojis in their email addresses.)

However, the recent PR does allow some characters.
If we are going to be consistent with that, I think it would be very nice to have an option to specify if there are multibyte characters that the developer really wants to allow.

sasata299 · 2024-11-13T02:22:36Z

@phoet

validation of display-names is different, so that you can write "Peter Schröder [email protected]"

I agree. Maybe so.
It could be separated from this PR and handled separately.

sasata299 · 2024-11-18T00:52:11Z

@micke
How about this one?

sasata299 · 2024-11-19T04:08:39Z

@micke @phoet

I made the following changes. 6a3e812

Basically, do not allow multibyte characters.
If you want to allow multibyte characters, set it explicitly

I would think this would cause less confusion since the default behavior is still the same as under ver 6.0.

micke · 2024-11-19T08:15:03Z

Looks great! Could you just document the configuration?

phoet · 2024-11-19T08:56:45Z

lib/valid_email2/address.rb

@@ -133,7 +140,9 @@ def mx_server_is_in?(domain_list)
    def address_contain_emoticons?
      return false if @raw_address.nil?

-      @raw_address.scan(Unicode::Emoji::REGEX).length >= 1
+      @raw_address.each_char.any? do |char|


maybe i missed the conversation here, but why is scan not used any longer? i think this would be good as a comment here (or even better a test for the edgecase). also, if permitted_multibyte_characters_regex we should probably not do any of this, right?

maybe i missed the conversation here, but why is scan not used any longer?

This was a difference that came in this PR, so I simply put it back in.
#257

I don't see any particular problem since each_char also handles one character at a time.

raw_address = 'あいうえお@gmail.com' raw_address.each_char do |char| print char, ' ' end #=> あいうえお @ g m a i l . c o m

i think this would be good as a comment here (or even better a test for the edgecase). also, if permitted_multibyte_characters_regex we should probably not do any of this, right?

@phoet
I'm sorry.. I didn't properly understand what you meant by your comment, could you please tell me again?

a) i would expect to use a regex with scan instead of each_char as it's easier to read and i would guess also much faster. if we dont use it here, we should document why we need to do it differently.

b) if permitted_multibyte_characters_regex is nil i would expect that we do not run the loop and return instead, or skip the call to scan as it does not work with nil

Understood.
Thank you. I will correct it. 👍

cool, thanks!

phoet · 2024-11-19T12:49:36Z

👍 LGTM

lib/valid_email2/address.rb

Co-authored-by: Micke Lisinge <[email protected]>

micke · 2024-11-19T22:45:08Z

Thank you @sasata299 for your work and patience!

…byte characters configurable (#261)

micke · 2024-11-19T22:57:39Z

This has been released in version 7.0.0! Thank you @phoet as well! :)

mscrivo · 2024-11-20T14:20:53Z

I understand the motivation behind this change, and I was not fond of the performance of the previous solution, but just to add a data point: We actually do see email addresses with non-ascii chars in the wild, like Japanese ones, and the previous solution worked fairly well for us. Now instead of disallowing emoji's and allowing all other non-ascii chars, the burden is on us to figure out all the possible multibyte chars we need to support.

phoet · 2024-11-20T14:26:02Z

I agree that the gem should work out of the box as most would expect. I think it would be great to have a proper real-life test harness with a couple of examples that should be supported. Maybe you could provide a couple of pseudonomized examples from your dataset.

sasata299 added 3 commits November 11, 2024 17:46

spec: add spec for address include Japanese characters

8415b7d

fix: allow only Scandinavian characters

726c202

stop using unicode/emoji gem

b2d17b5

Specify only allowed multibyte characters in the email address

6a3e812

phoet reviewed Nov 19, 2024

View reviewed changes

sasata299 added 2 commits November 19, 2024 18:48

Add readme for permitted_multibyte_characters_regex method

79d3da2

Make logic easy to understand

bb46da4

sasata299 force-pushed the fix_to_invalid_japanese_addresses branch from 94299bf to bb46da4 Compare November 19, 2024 12:40

Change the method name

267fac1

micke reviewed Nov 19, 2024

View reviewed changes

lib/valid_email2/address.rb Outdated Show resolved Hide resolved

Use ascii_only? method

cdf6717

Co-authored-by: Micke Lisinge <[email protected]>

micke merged commit 7baae98 into micke:main Nov 19, 2024
6 checks passed

micke pushed a commit that referenced this pull request Nov 19, 2024

fix!: revert allowing scandinavian characters by making allowed multi…

cf6a1e9

…byte characters configurable (#261)

github-actions bot mentioned this pull request Nov 19, 2024

chore(main): release 7.0.0 #262

Merged

sasata299 deleted the fix_to_invalid_japanese_addresses branch November 20, 2024 02:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix false multibyte character detection #261

Fix false multibyte character detection #261

sasata299 commented Nov 11, 2024 •

edited by micke

Loading

phoet commented Nov 11, 2024

sasata299 commented Nov 12, 2024

micke commented Nov 12, 2024

phoet commented Nov 12, 2024

sasata299 commented Nov 13, 2024

sasata299 commented Nov 13, 2024

sasata299 commented Nov 18, 2024 •

edited

Loading

sasata299 commented Nov 19, 2024

micke commented Nov 19, 2024

phoet Nov 19, 2024 •

edited

Loading

sasata299 Nov 19, 2024

sasata299 Nov 19, 2024

phoet Nov 19, 2024 •

edited

Loading

sasata299 Nov 19, 2024

phoet Nov 19, 2024

phoet commented Nov 19, 2024

micke commented Nov 19, 2024

micke commented Nov 19, 2024 •

edited

Loading

mscrivo commented Nov 20, 2024

phoet commented Nov 20, 2024

Fix false multibyte character detection #261

Fix false multibyte character detection #261

Conversation

sasata299 commented Nov 11, 2024 • edited by micke Loading

phoet commented Nov 11, 2024

sasata299 commented Nov 12, 2024

micke commented Nov 12, 2024

phoet commented Nov 12, 2024

sasata299 commented Nov 13, 2024

sasata299 commented Nov 13, 2024

sasata299 commented Nov 18, 2024 • edited Loading

sasata299 commented Nov 19, 2024

micke commented Nov 19, 2024

phoet Nov 19, 2024 • edited Loading

Choose a reason for hiding this comment

sasata299 Nov 19, 2024

Choose a reason for hiding this comment

sasata299 Nov 19, 2024

Choose a reason for hiding this comment

phoet Nov 19, 2024 • edited Loading

Choose a reason for hiding this comment

sasata299 Nov 19, 2024

Choose a reason for hiding this comment

phoet Nov 19, 2024

Choose a reason for hiding this comment

phoet commented Nov 19, 2024

micke commented Nov 19, 2024

micke commented Nov 19, 2024 • edited Loading

mscrivo commented Nov 20, 2024

phoet commented Nov 20, 2024

sasata299 commented Nov 11, 2024 •

edited by micke

Loading

sasata299 commented Nov 18, 2024 •

edited

Loading

phoet Nov 19, 2024 •

edited

Loading

phoet Nov 19, 2024 •

edited

Loading

micke commented Nov 19, 2024 •

edited

Loading