Unicode characters in Regex

Try incorporating \p{L} which will match a unicode "letter". So a and á should match against \p{L}.


Just for reference you don't need to escape the above ',. in your character class [], and you can avoid having to escape the dash - by placing it at the beginning or end of your character class.

You can use \p{L} which matches any kind of letter from any language. See the example below:

string[] names = { "Brendán", "Jóhn", "Jason" };
Regex rgx      = new Regex(@"^\p{L}+$");
foreach (string name in names)
    Console.WriteLine("{0} {1} a valid name.", name, rgx.IsMatch(name) ? "is" : "is not");

// Brendán is a valid name.
// Jóhn is a valid name.
// Jason is a valid name.

Or simply just add the desired characters to your character class [] you want to include.

@"^[a-zA-Z0-9áéíóú@#%&',.\s-]+$"

To expand your regular expression to include vowels with an acute accent (fada), you can use Unicode code points. You need to know about these unicode blocks:

  • C0 controls and Basic Latin
  • C1 controls and Latin-1 Supplement
  • and possibly Latin Extended-A

More Unicode code charts at http://www.unicode.org/charts/index.html#scripts, covering Latin Extended-B, -C and -D and Latin Extended-Addional (which ought to cover pretty much every European language in its entirety).

So, we see that the Irish fada vowels are

  • Á is \u00C1; á is \u00E1
  • É is \u00C9; é is \u00E9
  • Í is \u00CD; í is \u00ED
  • Ó is \u00D3; ó is \u00F3
  • Ú is \u00DA; ú is \u00FA

And thus your regular expression need to be extended:

Regex rx = new Regex( @"^[A-Za-z\u00C1\u00C9\u00CD\u00D3\u00DA\u00E1\u00E9\u00ED\u00F3\u00FA][A-Za-z\u00C1\u00C9\u00CD\u00D3\u00DA\u00E1\u00E9\u00ED\u00F3\u00FA0-9@#%&\'\-\s\.\,*]*$");

Tags:

C#

.Net

Regex