Detect language from string in PHP
I've used the Text_LanguageDetect pear package with some reasonable results. It's dead simple to use, and it has a modest 52 language database. The downside is no detection of Eastern Asian languages.
require_once 'Text/LanguageDetect.php';
$l = new Text_LanguageDetect();
$result = $l->detect($text, 4);
if (PEAR::isError($result)) {
echo $result->getMessage();
} else {
print_r($result);
}
results in:
Array
(
[german] => 0.407037037037
[dutch] => 0.288065843621
[english] => 0.283333333333
[danish] => 0.234526748971
)
I know this is an old post, but here is what I developed after not finding any viable solution.
- other suggestions are all too heavy and too cumbersome for my situation
- I support a finite number of languages on my website (at the moment two: 'en' and 'de' - but solution is generalised for more).
- I need a plausible guess about the language of a user-generated string, and I have a fallback (the language setting of the user).
- So I want a solution with minimal false positives - but don't care so much about false negatives.
The solution uses the 20 most common words in a language, counts the occurrences of those in the haystack. Then it just compares the counts of the first and second most counted languages. If the runner-up number is less than 10% of the winner, the winner takes it all.
Code - Any suggestions for speed improvement are more than welcome!
function getTextLanguage($text, $default) {
$supported_languages = array(
'en',
'de',
);
// German word list
// from http://wortschatz.uni-leipzig.de/Papers/top100de.txt
$wordList['de'] = array ('der', 'die', 'und', 'in', 'den', 'von',
'zu', 'das', 'mit', 'sich', 'des', 'auf', 'für', 'ist', 'im',
'dem', 'nicht', 'ein', 'Die', 'eine');
// English word list
// from http://en.wikipedia.org/wiki/Most_common_words_in_English
$wordList['en'] = array ('the', 'be', 'to', 'of', 'and', 'a', 'in',
'that', 'have', 'I', 'it', 'for', 'not', 'on', 'with', 'he',
'as', 'you', 'do', 'at');
// French word list
// from https://1000mostcommonwords.com/1000-most-common-french-words/
$wordList['fr'] = array ('comme', 'que', 'tait', 'pour', 'sur', 'sont', 'avec',
'tre', 'un', 'ce', 'par', 'mais', 'que', 'est',
'il', 'eu', 'la', 'et', 'dans', 'mot');
// Spanish word list
// from https://spanishforyourjob.com/commonwords/
$wordList['es'] = array ('que', 'no', 'a', 'la', 'el', 'es', 'y',
'en', 'lo', 'un', 'por', 'qu', 'si', 'una',
'los', 'con', 'para', 'est', 'eso', 'las');
// clean out the input string - note we don't have any non-ASCII
// characters in the word lists... change this if it is not the
// case in your language wordlists!
$text = preg_replace("/[^A-Za-z]/", ' ', $text);
// count the occurrences of the most frequent words
foreach ($supported_languages as $language) {
$counter[$language]=0;
}
for ($i = 0; $i < 20; $i++) {
foreach ($supported_languages as $language) {
$counter[$language] = $counter[$language] +
// I believe this is way faster than fancy RegEx solutions
substr_count($text, ' ' .$wordList[$language][$i] . ' ');;
}
}
// get max counter value
// from http://stackoverflow.com/a/1461363
$max = max($counter);
$maxs = array_keys($counter, $max);
// if there are two winners - fall back to default!
if (count($maxs) == 1) {
$winner = $maxs[0];
$second = 0;
// get runner-up (second place)
foreach ($supported_languages as $language) {
if ($language <> $winner) {
if ($counter[$language]>$second) {
$second = $counter[$language];
}
}
}
// apply arbitrary threshold of 10%
if (($second / $max) < 0.1) {
return $winner;
}
}
return $default;
}