Regex to detect locales?
To cater for basic variants:
^[A-Za-z]{2,4}([_-][A-Za-z]{4})?([_-]([A-Za-z]{2}|[0-9]{3}))?$
which consists of:
- Language code: ISO 639 2 or 3, or 4 for future use, alpha.
- Optional script code: ISO 15924 4 alpha.
- Optional country code: ISO 3166-1 2 alpha or 3 digit.
- Separated by underscores or dashes.
Valid examples are:
- de
- en-US
- zh-Hant-TW
- En-au
- aZ_cYrl-aZ.
Note that some programming language's functions may only accept particular forms, like only underscores and uppercase country code. PHP's intl
functions accept either case and separators. PayPal accepts only the language, or the la_CY
form, where la
is the language and CY
is the country/region. The PHP locale_canonicalize
function can be used to standardise to this format.
IETF RFC 5646, which governs internet usage of these tags, recommends a capitalisation and separation format like az-Cyrl-AZ
, as used in the first three examples above, though it says processors should accept any mix of case and either separator, as per the last two examples.
However, it also indicates to be wary of applying case-conversions in some locales, as it may produce invalid results with ASCII characters. That means either use a neutral locale to format (en_US), present an explicit list, or only accept entry of the recommended case as each character is typed.
The regex for the recommended basic format is:
^[a-z]{2,4}(-[A-Z][a-z]{3})?(-([A-Z]{2}|[0-9]{3}))?$
The regexp only covers the basic format. There are variants for extras, like local region. The CLDR includes locales en_US_POSIX
and ca_ES_VALENCIA
. It all depends upon the granularity required. RFC 5646 allows for such variants, along with private extensions and backwards-compatibility forms.
If using a CLDR-based function set, like PHP's intl
extension, you can check if a locale exists in the intl
database using a function like:
<?php
function is_locale($locale=''){
// STANDARDISE INPUT
$locale=locale_canonicalize($locale);
// LOAD ARRAY WITH LOCALES
$locales=resourcebundle_locales(NULL);
// RETURN WHETHER FOUND
return (array_search($locale,$locales)!==F);
}
?>
It takes about half a millisecond to load and search the data, so it won't be too much of a performance hit.
Of course, it will only find those in the database of the CLDR version supplied with the PHP version used, but will be updated with each subsequent PHP release.
Note that some locales are not for countries, but regions, and these are typically numeric, like 001
for 'World', 150
for 'Europe' and 419
for 'Latin America'. So there are now en-001
, en-150
, ar-001
, and es-419
, which can be used for generic language purposes. For example, en-001
was designed to decouple dependence upon en-us
as an ersatz English, especially since its date formats and spellings are radically different from the 100 other regional en
variants. The en-150
locale is the same as en-001
except for numbering separators and other Europe-specific formats.
However, [generally] a regexp is a good front-end sanity check to filter out illegal characters, and especially to reserve the format for possible future additions. It also helps to prevent malicious character combinations being sent to the lookup facility, especially if text-based lookup command mechanisms, like SQL or Xpath, are used.
That would be testing your input against:
\.[a-z]{2}-[A-Z]{2}$
This is really very literal: "match a dot (\.
, the dot being a special character in regexes), followed by exactly two of any characters from a
to z
([a-z]{2}
-- [...]
is a character class), followed by a dash (-
), followed by two of any characters from A
to Z
([A-Z]{2}
), followed by the end of input ($
).
http://www.dotnetperls.com/regex-match <-- how to apply this regex in C# against an input. It means the code would look like (UNTESTED):
// Post edit: this will really return a boolean
if (Regex.Match(input, @"\.[a-z]{2}-[A-Z]{2}$").Success) {
// there is a match
}
http://regex.info <-- buy that and read it, it is the BEST resource for regular expressions in the universe
http://regular-expressions.info <-- the second best resource
Rather than use Regex, I suggest you use the built-in support for cultures in .Net, i.e., the System.Globalization.CultureInfo class; the constructor recognizes valid culture strings, and gives you an object that can be used for culture specific operations:
try
{
string fileName = "MyResource.en-GB";
string cultureName = System.IO.Path.GetExtension(fileName).TrimStart('.');
CultureInfo cultureInfo = new CultureInfo(cultureName);
}
catch (ArgumentException)
{
// Invalid culture.
}