Best Regular Expression for Email Format Validation with ASP.NET 3.5 Validation

If you're wondering why this question is generating so little activity, it's because there are so many other issues that should be dealt with before you start thinking about performance. Foremost among those is whether you should be using regexes to validate email addresses at all--and the consensus is that you should not. It's much trickier than most people expect, and probably pointless anyway.

Another problem is that your two regexes vary hugely in the kinds of strings they can match. For example, the second one is anchored at both ends, but the first isn't; it would match ">>>>[email protected]<<<<" because there's something that looks like an email address embedded in it. Maybe the framework forces the regex to match the whole string, but if that's the case, why is the second one anchored?

Another difference is that the first regex uses \w throughout, while the second uses [0-9a-zA-Z] in many places. In most regex flavors, \w matches the underscore in addition to letters and digits, but in some (including .NET) it also matches letters and digits from every writing system known to Unicode.

There are many other differences, but that's academic; neither of those regexes is very good. See here for a good discussion of the topic, and a much better regex.

Getting back to the original question, I don't see a performance problem with either of those regexes. Aside from the nested-quantifiers anti-pattern cited in that BCL blog entry, you should also watch out for situations where two or more adjacent parts of the regex can match the same set of characters--for example,

([A-Za-z]+|\w+)@

There's nothing like that in either of the regexes you posted. Parts that are controlled by quantifiers are always broken up by other parts that aren't quantified. Both regexes will experience some avoidable backtracking, but there are many better reasons than performance to reject them.

EDIT: So the second regex is subject to catastrophic backtracking; I should have tested it thoroughly before shooting my mouth off. Taking a closer look at that regex, I don't see why you need the outer asterisk in the first part:

[0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*

All that bit does is make sure the first and last characters are alphanumeric while allowing some additional characters in between. This version does the same thing, but it fails much more quickly when no match is possible:

[0-9a-zA-Z][-.\w]*[0-9a-zA-Z]

That would probably suffice to eliminate the backtracking problem, but you could also make the part after the "@" more efficient by using an atomic group:

(?>(?:[0-9a-zA-Z][-\w]*[0-9a-zA-Z]\.)+)[a-zA-Z]{2,9}

In other words, if you've matched all you can of substrings that look like domain components with trailing dots, and the next part doesn't look like a TLD, don't bother backtracking. The first character you would have to give up is the final dot, and you know [a-zA-Z]{2,9} won't match that.

We use this RegEx which has been tested in-house against 1.5 million addresses. It correctly identifies better than 98% of ours, but there are some formats that I'm aware of that it would error on.

^([\w-]+(?:\.[\w-]+)*)@((?:[\w-]+\.)*\w[\w-]{0,66})\.([a-z]{2,6}(?:\.[a-z]{2})?)$

We also make sure that there are no EOL characters in the data since an EOL can fake out this RegEx. Our Function:

Public Function IsValidEmail(ByVal strEmail As String) As Boolean
    ' Check An eMail Address To Ensure That It Is Valid
    Const cValidEmail = "^([\w-]+(?:\.[\w-]+)*)@((?:[\w-]+\.)*\w[\w-]{0,66})\.([a-z]{2,6}(?:\.[a-z]{2})?)$"   ' 98% Of All Valid eMail Addresses
    IsValidEmail = False
    ' Take Care Of Blanks, Nulls & EOLs
    strEmail = Replace(Replace(Trim$(strEmail & " "), vbCr, ""), vbLf, "")
    ' Blank eMail Is Invalid
    If strEmail = "" Then Exit Function
    ' RegEx Test The eMail Address
    Dim regEx As New System.Text.RegularExpressions.Regex(cValidEmail)
    IsValidEmail = regEx.IsMatch(strEmail)
End Function

I am a newbie, but I tried the following and it seemed to have limited the ".xxx" to only two occurrences or less, after the symbol '@'.

^([a-zA-Z0-9]+[a-zA-Z0-9._%-]*@(?:[a-zA-Z0-9-])+(\.+[a-zA-Z]{2,4}){1,2})$

Note: I had to substitute single '\' with double '\\' as I am using this reg expr in R.

Best Regular Expression for Email Format Validation with ASP.NET 3.5 Validation

Tags:

Asp.Net

Regex

Related

Recent Posts