Safe email validation
Are those email addresses valid?
Yes, they are. See for example here or with a bit more explanation here.
For a nice explanation on how emails may look, see the informational RFC3696. The more technical RFCs are linked there as well.
Attacks possible in the local part of an Email Address
Without quotes, local-parts may consist of any combination of
alphabetic characters, digits, or any of the special characters! # $ % & ' * + - / = ? ^ _ ` . { | } ~
period (".") may also appear, but may not be used to start or end the local part, nor may two or more consecutive periods appear. Stated differently, any ASCII graphic (printing) character other than the at-sign ("@"), backslash, double quote, comma, or square brackets may appear without quoting. If any of that list of excluded characters are to appear, they must be quoted.
So the rule is more or less: most characters can be part of the local part, except for @\",[]
, those must be in-between "
(except of course "
itself, which has to be escaped when in a quoted string).
There are also rules on where and when to quote and how to handle comments, but that's less relevant to your question.
The point here is that many attacks can be part of the local part of an email address, for example:
'/**/OR/**/1=1/**/--/**/@a.a
"<script>alert(1)</script>"@example.com
" onmouseover=alert(1) foo="@example.com
"../../../../../test%00"@example.com
- ...
Attacks possible in the domain part of an Email Address
The exact structure of the domain part can be seen in RFC2822 or RFC5322:
addr-spec = local-part "@" domain local-part = dot-atom / quoted-string / obs-local-part domain = dot-atom / domain-literal / obs-domain domain-literal = [CFWS] "[" *([FWS] dcontent) [FWS] "]" [CFWS] dcontent = dtext / quoted-pair dtext = NO-WS-CTL / ; Non white space controls %d33-90 / ; The rest of the US-ASCII %d94-126 ; characters not including "[", ; "]", or "\"
Where:
dtext = %d33-90 / ; Printable US-ASCII %d94-126 / ; characters not including obs-dtext ; "[", "]", or "\"
You can see that again, most characters are allowed (even non-ascii characters). Possible attacks would be:
[email protected]&a=////etc/passwd
foo@bar(<script>alert(1)</script>).com
foo@'/**/OR/**/1=1/**/--/**/
Conclusion
You can't validate email addresses safely.
Instead, you need to make sure to have proper defenses in place (HTML encoding for XSS, prepared statements for SQL injection, etc).
As defense in depth, you could forbid quoted strings and comments to gain some amount of protection, as these two things allow the most unusual characters and string. But some attacks are still possible, and you will exclude a small amount of users.
If you do need additional input filtering that exceeds the limits of the email format, because you do not trust the rest of your application, you should carefully consider what you do allow and what you do not allow. For example +
is used by gmail to allow filtering incoming emails, so not allowing it may lead users to not sign up. Other characters may be used by other providers for similar functionalities. A first approach might be to only allow alphanum + ! # % * + - = ? ^ _ . | ~
. This would disallow < > ' " ` / $ { } &
, which are characters used in common attacks. Depending on your application, you may want to disallow further characters.
And as you mentioned RFC822: It is a bit outdated (it's from 1982), but even it allows for quoted strings and comments, so just saying that you only accept RFC822 compliant addresses would not only not be practical, but also not work.
Also, are you checking your emails client-side? The JS code gives that impression. An attacker could just bypass client-side checks.
The simplest way to test this would be to try sending an email to that address, from a send-only address (i.e. from [email protected]). If it can't be delivered, it's not valid.
Using a regex to parse emails is probably best done on the client side to let them know in advance that they may have typos in their email address, before they register.
You say you want to have safe e-mail addresses. I presume this means these are put into your app and you expect some predictable output. The developers who write your app have in their collective head some idea what to expect inside an e-mail field, and you better not allow anything else there. What your programmers don't expect is not very safe (even if it's valid according to some horrifying RFCs).
So if your developers are not very much into email-related RFCs, I suggest to use "a willful violation of RFC 5322" that happens to exist within a W3C standard for HTML5, and translates to quite a simple regular expression:
^[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$
source http://www.w3.org/TR/html5/forms.html#valid-e-mail-address
In case this is too lax (if you think your developers don't expect those strange #$%&|
etc), I suggest securing it a bit more:
^[a-zA-Z0-9.+/=?^_-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)+$
I think 99.9% of real people addresses match both of these expressions.