Detect language from string in PHP

I've used the Text_LanguageDetect pear package with some reasonable results. It's dead simple to use, and it has a modest 52 language database. The downside is no detection of Eastern Asian languages.

require_once 'Text/LanguageDetect.php';
$l = new Text_LanguageDetect();
$result = $l->detect($text, 4);
if (PEAR::isError($result)) {
    echo $result->getMessage();
} else {
    print_r($result);
}

results in:

Array
(
    [german] => 0.407037037037
    [dutch] => 0.288065843621
    [english] => 0.283333333333
    [danish] => 0.234526748971
)

I know this is an old post, but here is what I developed after not finding any viable solution.

  • other suggestions are all too heavy and too cumbersome for my situation
  • I support a finite number of languages on my website (at the moment two: 'en' and 'de' - but solution is generalised for more).
  • I need a plausible guess about the language of a user-generated string, and I have a fallback (the language setting of the user).
  • So I want a solution with minimal false positives - but don't care so much about false negatives.

The solution uses the 20 most common words in a language, counts the occurrences of those in the haystack. Then it just compares the counts of the first and second most counted languages. If the runner-up number is less than 10% of the winner, the winner takes it all.

Code - Any suggestions for speed improvement are more than welcome!

    function getTextLanguage($text, $default) {
      $supported_languages = array(
          'en',
          'de',
      );
      // German word list
      // from http://wortschatz.uni-leipzig.de/Papers/top100de.txt
      $wordList['de'] = array ('der', 'die', 'und', 'in', 'den', 'von', 
          'zu', 'das', 'mit', 'sich', 'des', 'auf', 'für', 'ist', 'im', 
          'dem', 'nicht', 'ein', 'Die', 'eine');
      // English word list
      // from http://en.wikipedia.org/wiki/Most_common_words_in_English
      $wordList['en'] = array ('the', 'be', 'to', 'of', 'and', 'a', 'in', 
          'that', 'have', 'I', 'it', 'for', 'not', 'on', 'with', 'he', 
          'as', 'you', 'do', 'at');
      // French word list
      // from https://1000mostcommonwords.com/1000-most-common-french-words/
      $wordList['fr'] = array ('comme', 'que',  'tait',  'pour',  'sur',  'sont',  'avec',
                         'tre',  'un',  'ce',  'par',  'mais',  'que',  'est',
                         'il',  'eu',  'la', 'et', 'dans', 'mot');

      // Spanish word list
      // from https://spanishforyourjob.com/commonwords/
      $wordList['es'] = array ('que', 'no', 'a', 'la', 'el', 'es', 'y',
                         'en', 'lo', 'un', 'por', 'qu', 'si', 'una',
                         'los', 'con', 'para', 'est', 'eso', 'las');
      // clean out the input string - note we don't have any non-ASCII 
      // characters in the word lists... change this if it is not the 
      // case in your language wordlists!
      $text = preg_replace("/[^A-Za-z]/", ' ', $text);
      // count the occurrences of the most frequent words
      foreach ($supported_languages as $language) {
        $counter[$language]=0;
      }
      for ($i = 0; $i < 20; $i++) {
        foreach ($supported_languages as $language) {
          $counter[$language] = $counter[$language] + 
            // I believe this is way faster than fancy RegEx solutions
            substr_count($text, ' ' .$wordList[$language][$i] . ' ');;
        }
      }
      // get max counter value
      // from http://stackoverflow.com/a/1461363
      $max = max($counter);
      $maxs = array_keys($counter, $max);
      // if there are two winners - fall back to default!
      if (count($maxs) == 1) {
        $winner = $maxs[0];
        $second = 0;
        // get runner-up (second place)
        foreach ($supported_languages as $language) {
          if ($language <> $winner) {
            if ($counter[$language]>$second) {
              $second = $counter[$language];
            }
          }
        }
        // apply arbitrary threshold of 10%
        if (($second / $max) < 0.1) {
          return $winner;
        } 
      }
      return $default;
    }