Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange Unicode characters pass validation, but fail actually sending an email #100

Open
wenley opened this issue Nov 14, 2019 · 8 comments

Comments

@wenley
Copy link

wenley commented Nov 14, 2019

bad_character = 8203.chr
bad_email = "#{bad_character}[email protected]"

ValidateEmail.valid?(bad_email)
# => true

However, when attempting to send an email to that address via the mail gem, I receive this error: mikel/mail#1126

I'm not entirely sure which gem is the best place to fix this issue, but in my head, this input "should" have been caught by valid_email. Hence, I'm filing the issue here.

@wenley
Copy link
Author

wenley commented Nov 14, 2019

I'll note that although the codepoint is described as a zero-width space, it does not match Ruby's /\s/ or /[[:space:]]/ character classes.

@amanda-mitchell
Copy link

I did a little bit of digging into this…

My first thought was to check what category the zero-width space belongs to in order to see if the whole category could be blacklisted. It turns out that it's a member of Cf[1] (which explains why it doesn't get caught by the space regex above), and the other characters don't seem like they ought to be in email addresses. Somewhat hopeful, I went looking for confirmation of this.

But!

RFC 6531, section 3.2 has this cryptic little note:

Although the characters in the are permitted to contain non-ASCII characters, the actual parsing of the and the delimiters used are unchanged from the base email specification [RFC5321].

So… 😕

What does RFC 5321 say?

The local-part is either a dot-string or a quoted-string.

A dotted string is an Atom followed by 0 or more ".", Atom pairs.

An Atom is 1 or more instances of atext.

An atext is…not defined in this spec. Surprise!

It's actually in RFC 5322 section 3.2.3

atext is an (ASCII) alphabet character, a digit (again, ASCII), or one of !#$%&'*+-/=?^_{|}~`

So far, not a lot of room for zero-width space…or any other Unicode character, really…

So let's back up to quoted-string

quoted-string is 0 or more QcontentSMTP entities surrounded by a DQUOTE on either side. (What's a DQUOTE? It's a normal double-quote character ", but to verify that, you need to consult RFC 5234, Appendix B-1)

A QcontentSMTP is either a qtextSMTP or a quoted-pairSMTP

Ha…nevermind!

atext gets redefined in RFC 6531 section 3.3 so that it also includes UTF8-non-ascii, which via RFC 6531, section 3.1 is any Unicode character whose UTF8 representation requires 2, 3, or 4 bytes (per RFC 3629, Section 4)[2]

So!

It turns out that a valid email address CAN include a zero-width space, but only if the SMTP server supports the SMTPUTF8 extension defined in RFC 6531.

This leaves two major possibilities:

  1. The server we use doesn't support SMTPUTF8, in which case we should refuse to accept any non-ASCII character in an email address prior to performing address validation
  2. The server we use does support SMTPUTF8 and is non-compliant with the spec. (yay!)

In either case, I believe that the valid_email gem is probably correct in allowing this character in an email address.

[1] 'Other, Format' https://www.fileformat.info/info/unicode/category/Cf/list.htm
[2] Interestingly, UTF-8 can include characters that are represented with up to 8 bytes, but the last four were probably added after RFC-3629 was introduced. There's probably an RFC that corrects the definition of non-ASCII UTF-8, but I'm not going to look for it. 😩

@hallelujah
Copy link
Owner

Thank you for the investigation @wenley and @david-mitchell

I will try to read the other RFC to see which non-ASCII UTF8 characters are allowed

@alexevanczuk
Copy link

Hi! I wanted to revise this thread because I found another example where valid? is returning true where I'm pretty sure it should not return true:

ValidateEmail.valid?('😀@domain.com')
=> true

@alexevanczuk
Copy link

I can submit a fix for the emoji issue next week!

@hallelujah
Copy link
Owner

Hi @alexevanczuk,

Thank you for the report. Re-reading this thread, I thought that one of the solution would to only add an option to only allow ASCII characters. That depends on the use case of course. Either add it in this gem, or as a another validation in the application.

@alexevanczuk
Copy link

@hallelujah I am under the impression that many non-ASCII characters are legitimately allowed in email addresses these days, but emojis are not. I was thinking the emoji change could be made to the existing API and folks can create a "ASCII-only" API later if they want. What do you think?

@alexevanczuk
Copy link

Actually – looking more into it turns out emojis are just a type of unicode character, hence the existence of this gem: https://github.com/ticky/ruby-emoji-regex.

Given this, it probably makes sense to handle this like other unicode characters. Although – if we can find evidence in the email RFC that some unicode characters (perhaps those used in other non-English alphabets) are allowed, but some (e.g. emojis) are not, that might make more sense to include in the existing API without a separate flag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants