What would be regex for matching foreign characters?
\p{L} isn't cross-browser yet. Transpiling down from this will give you massively bloated code if you use it a lot.
Here is a short and sweet answer to generally including non-ascii letters that doesn't add a gazillion lines of JavaScript or plugins. Replace a-zA-Z0-9 or \w in your regex with this, and don't use the u flag:
\u00BF-\u1FFF\u2C00-\uD7FF\w
This inserted into all my JavaScript regexes in place of a-zA-Z0-9 or \w, seems to do the job. My context was in the discerning of UTF-8 in HTML and CSS, and it had to be cross-browser.
I can't believe it is this simple, so am waiting to be proved wrong, after a day's searching of trying to get something to work in Firefox...
I've only tested this using Japanese hirigana with a french accent.
If you want to match any Latin character with an accent or diacritic mark in virtually any regular expressions engine, try:
[A-Za-zŽžÀ-ÿ]
It matches any character in the "Printable and Extended ASCII Character" sets following:
ABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyz
ŽžÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ
Matches {char} (ASCII character index, case sensitive):
char(s) | index(start) | index(end) |
---|---|---|
[A-Z] | 65 | 90 |
[a-z] | 97 | 122 |
Ž | 142 | --- |
ž | 158 | --- |
[À-ÿ] | 192 | 255 |
Test it at https://regex101.com/r/Xbbtm1/1
If all you want to match is letters (including "international" letters) you can use \p{L}
.
You can find some information on regex and Unicode here.