What would be regex for matching foreign characters?

\p{L} isn't cross-browser yet. Transpiling down from this will give you massively bloated code if you use it a lot.

Here is a short and sweet answer to generally including non-ascii letters that doesn't add a gazillion lines of JavaScript or plugins. Replace a-zA-Z0-9 or \w in your regex with this, and don't use the u flag:

\u00BF-\u1FFF\u2C00-\uD7FF\w

This inserted into all my JavaScript regexes in place of a-zA-Z0-9 or \w, seems to do the job. My context was in the discerning of UTF-8 in HTML and CSS, and it had to be cross-browser.

I can't believe it is this simple, so am waiting to be proved wrong, after a day's searching of trying to get something to work in Firefox...

I've only tested this using Japanese hirigana with a french accent.


If you want to match any Latin character with an accent or diacritic mark in virtually any regular expressions engine, try:

[A-Za-zŽžÀ-ÿ]

It matches any character in the "Printable and Extended ASCII Character" sets following:

ABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyz
ŽžÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ

Matches {char} (ASCII character index, case sensitive):

char(s) index(start) index(end)
[A-Z] 65 90
[a-z] 97 122
Ž 142 ---
ž 158 ---
[À-ÿ] 192 255

Test it at https://regex101.com/r/Xbbtm1/1


If all you want to match is letters (including "international" letters) you can use \p{L}.

You can find some information on regex and Unicode here.

Tags:

Regex