What's a good regex to include accented characters in a simple way?

Accented Characters: DIY Character Range Subtraction

If your regex engine allows it (and many will), this will work:

(?i)^(?:(?![×Þß÷þø])[-'0-9a-zÀ-ÿ])+$

Please see the demo (you can add characters to test).

Explanation

  • (?i) sets case-insensitive mode
  • The ^ anchor asserts that we are at the beginning of the string
  • (?:(?![×Þß÷þø])[-'0-9a-zÀ-ÿ]) matches one character...
  • The lookahead (?![×Þß÷þø]) asserts that the char is not one of those in the brackets
  • [-'0-9a-zÀ-ÿ] allows dash, apostrophe, digits, letters, and chars in a wide accented range, from which we need to subtract
  • The + matches that one or more times
  • The $ anchor asserts that we are at the end of the string

Reference

Extended ASCII Table


A version without the exclusion rules:

^[-'a-zA-ZÀ-ÖØ-öø-ÿ]+$

Explanation

  • The ^ anchor asserts that we are at the beginning of the string
  • [...] allows dash, apostrophe, digits, letters, and chars in a wide accented range,
  • The + matches that one or more times
  • The $ anchor asserts that we are at the end of the string

Reference

  • Extended ASCII Table

You just put in your expression:

\p{L}\p{M}

This in Unicode will match:

  • any letter character (L) from any language
  • and marks (M)(i.e, a character that is to be combined with another: accent, etc.)

Tags:

Regex