Regex to detect locales?

To cater for basic variants:

^[A-Za-z]{2,4}([_-][A-Za-z]{4})?([_-]([A-Za-z]{2}|[0-9]{3}))?$

which consists of:

Language code: ISO 639 2 or 3, or 4 for future use, alpha.
Optional script code: ISO 15924 4 alpha.
Optional country code: ISO 3166-1 2 alpha or 3 digit.
Separated by underscores or dashes.

Valid examples are:

de
en-US
zh-Hant-TW
En-au
aZ_cYrl-aZ.

Note that some programming language's functions may only accept particular forms, like only underscores and uppercase country code. PHP's intl functions accept either case and separators. PayPal accepts only the language, or the la_CY form, where la is the language and CY is the country/region. The PHP locale_canonicalize function can be used to standardise to this format.

IETF RFC 5646, which governs internet usage of these tags, recommends a capitalisation and separation format like az-Cyrl-AZ, as used in the first three examples above, though it says processors should accept any mix of case and either separator, as per the last two examples.

However, it also indicates to be wary of applying case-conversions in some locales, as it may produce invalid results with ASCII characters. That means either use a neutral locale to format (en_US), present an explicit list, or only accept entry of the recommended case as each character is typed.

The regex for the recommended basic format is:

^[a-z]{2,4}(-[A-Z][a-z]{3})?(-([A-Z]{2}|[0-9]{3}))?$

The regexp only covers the basic format. There are variants for extras, like local region. The CLDR includes locales en_US_POSIX and ca_ES_VALENCIA. It all depends upon the granularity required. RFC 5646 allows for such variants, along with private extensions and backwards-compatibility forms.

If using a CLDR-based function set, like PHP's intl extension, you can check if a locale exists in the intl database using a function like:

<?php
 function is_locale($locale=''){
  // STANDARDISE INPUT
  $locale=locale_canonicalize($locale);
  
  // LOAD ARRAY WITH LOCALES
  $locales=resourcebundle_locales(NULL);
  
  // RETURN WHETHER FOUND
  return (array_search($locale,$locales)!==F);
 }
?>

It takes about half a millisecond to load and search the data, so it won't be too much of a performance hit.

Of course, it will only find those in the database of the CLDR version supplied with the PHP version used, but will be updated with each subsequent PHP release.

Note that some locales are not for countries, but regions, and these are typically numeric, like 001 for 'World', 150 for 'Europe' and 419 for 'Latin America'. So there are now en-001, en-150, ar-001, and es-419, which can be used for generic language purposes. For example, en-001 was designed to decouple dependence upon en-us as an ersatz English, especially since its date formats and spellings are radically different from the 100 other regional en variants. The en-150 locale is the same as en-001 except for numbering separators and other Europe-specific formats.

However, [generally] a regexp is a good front-end sanity check to filter out illegal characters, and especially to reserve the format for possible future additions. It also helps to prevent malicious character combinations being sent to the lookup facility, especially if text-based lookup command mechanisms, like SQL or Xpath, are used.

That would be testing your input against:

\.[a-z]{2}-[A-Z]{2}$

This is really very literal: "match a dot (\., the dot being a special character in regexes), followed by exactly two of any characters from a to z ([a-z]{2} -- [...] is a character class), followed by a dash (-), followed by two of any characters from A to Z ([A-Z]{2}), followed by the end of input ($).

http://www.dotnetperls.com/regex-match <-- how to apply this regex in C# against an input. It means the code would look like (UNTESTED):

// Post edit: this will really return a boolean
if (Regex.Match(input, @"\.[a-z]{2}-[A-Z]{2}$").Success) {
    // there is a match
}

http://regex.info <-- buy that and read it, it is the BEST resource for regular expressions in the universe

http://regular-expressions.info <-- the second best resource

Rather than use Regex, I suggest you use the built-in support for cultures in .Net, i.e., the System.Globalization.CultureInfo class; the constructor recognizes valid culture strings, and gives you an object that can be used for culture specific operations:

try
{
    string fileName = "MyResource.en-GB";
    string cultureName = System.IO.Path.GetExtension(fileName).TrimStart('.');
    CultureInfo cultureInfo = new CultureInfo(cultureName);
}
catch (ArgumentException)
{
    // Invalid culture.
}

Regex to detect locales?

Tags:

C#

Regex

Related

Recent Posts