Best practice multi language website
Implementing i18n Without The Performance Hit Using a Pre-Processor as suggested by Thomas Bley
At work, we recently went through implementation of i18n on a couple of our properties, and one of the things we kept struggling with was the performance hit of dealing with on-the-fly translation, then I discovered this great blog post by Thomas Bley which inspired the way we're using i18n to handle large traffic loads with minimal performance issues.
Instead of calling functions for every translation operation, which as we know in PHP is expensive, we define our base files with placeholders, then use a pre-processor to cache those files (we store the file modification time to make sure we're serving the latest content at all times).
The Translation Tags
Thomas uses {tr}
and {/tr}
tags to define where translations start and end. Due to the fact that we're using TWIG, we don't want to use {
to avoid confusion so we use [%tr%]
and [%/tr%]
instead. Basically, this looks like this:
`return [%tr%]formatted_value[%/tr%];`
Note that Thomas suggests using the base English in the file. We don't do this because we don't want to have to modify all of the translation files if we change the value in English.
The INI Files
Then, we create an INI file for each language, in the format placeholder = translated
:
// lang/fr.ini
formatted_value = number_format($value * Model_Exchange::getEurRate(), 2, ',', ' ') . '€'
// lang/en_gb.ini
formatted_value = '£' . number_format($value * Model_Exchange::getStgRate())
// lang/en_us.ini
formatted_value = '$' . number_format($value)
It would be trivial to allow a user to modify these inside the CMS, just get the keypairs by a preg_split
on \n
or =
and making the CMS able to write to the INI files.
The Pre-Processor Component
Essentially, Thomas suggests using a just-in-time 'compiler' (though, in truth, it's a preprocessor) function like this to take your translation files and create static PHP files on disk. This way, we essentially cache our translated files instead of calling a translation function for every string in the file:
// This function was written by Thomas Bley, not by me
function translate($file) {
$cache_file = 'cache/'.LANG.'_'.basename($file).'_'.filemtime($file).'.php';
// (re)build translation?
if (!file_exists($cache_file)) {
$lang_file = 'lang/'.LANG.'.ini';
$lang_file_php = 'cache/'.LANG.'_'.filemtime($lang_file).'.php';
// convert .ini file into .php file
if (!file_exists($lang_file_php)) {
file_put_contents($lang_file_php, '<?php $strings='.
var_export(parse_ini_file($lang_file), true).';', LOCK_EX);
}
// translate .php into localized .php file
$tr = function($match) use (&$lang_file_php) {
static $strings = null;
if ($strings===null) require($lang_file_php);
return isset($strings[ $match[1] ]) ? $strings[ $match[1] ] : $match[1];
};
// replace all {t}abc{/t} by tr()
file_put_contents($cache_file, preg_replace_callback(
'/\[%tr%\](.*?)\[%\/tr%\]/', $tr, file_get_contents($file)), LOCK_EX);
}
return $cache_file;
}
Note: I didn't verify that the regex works, I didn't copy it from our company server, but you can see how the operation works.
How to Call It
Again, this example is from Thomas Bley, not from me:
// instead of
require("core/example.php");
echo (new example())->now();
// we write
define('LANG', 'en_us');
require(translate('core/example.php'));
echo (new example())->now();
We store the language in a cookie (or session variable if we can't get a cookie) and then retrieve it on every request. You could combine this with an optional $_GET
parameter to override the language, but I don't suggest subdomain-per-language or page-per-language because it'll make it harder to see which pages are popular and will reduce the value of inbound links as you'll have them more scarcely spread.
Why use this method?
We like this method of preprocessing for three reasons:
- The huge performance gain from not calling a whole bunch of functions for content which rarely changes (with this system, 100k visitors in French will still only end up running translation replacement once).
- It doesn't add any load to our database, as it uses simple flat-files and is a pure-PHP solution.
- The ability to use PHP expressions within our translations.
Getting Translated Database Content
We just add a column for content in our database called language
, then we use an accessor method for the LANG
constant which we defined earlier on, so our SQL calls (using ZF1, sadly) look like this:
$query = select()->from($this->_name)
->where('language = ?', User::getLang())
->where('id = ?', $articleId)
->limit(1);
Our articles have a compound primary key over id
and language
so article 54
can exist in all languages. Our LANG
defaults to en_US
if not specified.
URL Slug Translation
I'd combine two things here, one is a function in your bootstrap which accepts a $_GET
parameter for language and overrides the cookie variable, and another is routing which accepts multiple slugs. Then you can do something like this in your routing:
"/wilkommen" => "/welcome/lang/de"
... etc ...
These could be stored in a flat file which could be easily written to from your admin panel. JSON or XML may provide a good structure for supporting them.
Notes Regarding A Few Other Options
PHP-based On-The-Fly Translation
I can't see that these offer any advantage over pre-processed translations.
Front-end Based Translations
I've long found these interesting, but there are a few caveats. For example, you have to make available to the user the entire list of phrases on your website that you plan to translate, this could be problematic if there are areas of the site you're keeping hidden or haven't allowed them access to.
You'd also have to assume that all of your users are willing and able to use Javascript on your site, but from my statistics, around 2.5% of our users are running without it (or using Noscript to block our sites from using it).
Database-Driven Translations
PHP's database connectivity speeds are nothing to write home about, and this adds to the already high overhead of calling a function on every phrase to translate. The performance & scalability issues seem overwhelming with this approach.
I suggest you not to invent a wheel and use gettext and ISO languages abbrevs list. Have you seen how i18n/l10n implemented in popular CMSes or frameworks?
Using gettext you will have a powerful tool where many of cases is already implemented like plural forms of numbers. In english you have only 2 options: singular and plural. But in Russian for example there are 3 forms and its not as simple as in english.
Also many translators already have experience to work with gettext.
Take a look to CakePHP or Drupal . Both multilingual enabled. CakePHP as example of interface localization and Drupal as example of content translation.
For l10n using database isn't the case at all. It will be tons on queries. Standard approach is to get all l10n data in memory in early stage (or during first call to i10n function if you prefer lazy loading). It can be reading from .po file or from DB all data at once. And than just read requested strings from array.
If you need to implement online tool to translate interface you can have all that data in DB but than still save all data to file to work with it. To reduce amount of data in memory you can split all your translated messages/strings into groups and than load only that groups you need if it will be possible.
So you totally right in your #3. With one exception: usually it is one big file not a per-controller file or so. Because it is best for performance to open one file. You probably know that some highloaded web apps compiles all PHP code in one file to avoid file operations when include/require called.
About URLs. Google indirectly suggest to use translation:
to clearly indicate French content: http://example.ca/fr/vélo-de-montagne.html
Also i think you need to redirect user to default language prefix e.g. http://examlpe.com/about-us will redirects to http://examlpe.com/en/about-us But if your site use only one language so you don't need prefixes at all.
Check out: http://www.audiomicro.com/trailer-hit-impact-psychodrama-sound-effects-836925 http://nl.audiomicro.com/aanhangwagen-hit-effect-psychodrama-geluidseffecten-836925 http://de.audiomicro.com/anhanger-hit-auswirkungen-psychodrama-sound-effekte-836925
Translating content is more difficult task. I think it will be some differences with different types of content e.g. articles, menu items etc. But in #4 you're in the right way. Take a look in Drupal to have more ideas. It have clear enough DB schema and good enough interface for translating. Like you creating article and select language for it. And than you can later translate it to other languages.
I think it isn't problem with URL slugs. You can just create separate table for slugs and it will be right decision. Also using right indexes it isn't problem to query table even with huge amount of data. And it wasn't full text search but string match if will use varchar data type for slug and you can have an index on that field too.
PS Sorry, my English is far from perfect though.
Topic's premise
There are three distinct aspects in a multilingual site:
- interface translation
- content
- url routing
While they all interconnected in different ways, from CMS point of view they are managed using different UI elements and stored differently. You seem to be confident in your implementation and understanding of the first two. The question was about the latter aspect - "URL Translation? Should we do this or not? and in what way?"
What the URL can be made of?
A very important thing is, don't get fancy with IDN. Instead favor transliteration (also: transcription and romanization). While at first glance IDN seems viable option for international URLs, it actually does not work as advertised for two reasons:
- some browsers will turn the non-ASCII chars like
'ч'
or'ž'
into'%D1%87'
and'%C5%BE'
- if user has custom themes, the theme's font is very likely to not have symbols for those letters
I actually tried to IDN approach few years ago in a Yii based project (horrible framework, IMHO). I encountered both of the above mentioned problems before scraping that solution. Also, I suspect that it might be an attack vector.
Available options ... as I see them.
Basically you have two choices, that could be abstracted as:
http://site.tld/[:query]
: where[:query]
determines both language and content choicehttp://site.tld/[:language]/[:query]
: where[:language]
part of URL defines the choice of language and[:query]
is used only to identify the content
Query is Α and Ω ..
Lets say you pick http://site.tld/[:query]
.
In that case you have one primary source of language: the content of [:query]
segment; and two additional sources:
- value
$_COOKIE['lang']
for that particular browser - list of languages in HTTP Accept-Language (1), (2) header
First, you need to match the query to one of defined routing patterns (if your pick is Laravel, then read here). On successful match of pattern you then need to find the language.
You would have to go through all the segments of the pattern. Find the potential translations for all of those segments and determine which language was used. The two additional sources (cookie and header) would be used to resolve routing conflicts, when (not "if") they arise.
Take for example: http://site.tld/blog/novinka
.
That's transliteration of "блог, новинка"
, that in English means approximately "blog", "latest"
.
As you can already notice, in Russian "блог" will be transliterated as "blog". Which means that for the first part of [:query]
you (in the best case scenario) will end up with ['en', 'ru']
list of possible languages. Then you take next segment - "novinka". That might have only one language on the list of possibilities: ['ru']
.
When the list has one item, you have successfully found the language.
But if you end up with 2 (example: Russian and Ukrainian) or more possibilities .. or 0 possibilities, as a case might be. You will have to use cookie and/or header to find the correct option.
And if all else fails, you pick the site's default language.
Language as parameter
The alternative is to use URL, that can be defined as http://site.tld/[:language]/[:query]
. In this case, when translating query, you do not need to guess the language, because at that point you already know which to use.
There is also a secondary source of language: the cookie value. But here there is no point in messing with Accept-Language header, because you are not dealing with unknown amount of possible languages in case of "cold start" (when user first time opens site with custom query).
Instead you have 3 simple, prioritized options:
- if
[:language]
segment is set, use it - if
$_COOKIE['lang']
is set, use it - use default language
When you have the language, you simply attempt to translate the query, and if translation fails, use the "default value" for that particular segment (based on routing results).
Isn't here a third option?
Yes, technically you can combine both approaches, but that would complicate the process and only accommodate people who want to manually change URL of http://site.tld/en/news
to http://site.tld/de/news
and expect the news page to change to German.
But even this case could probable be mitigated using cookie value (which would contain information about previous choice of language), to implement with less magic and hope.
Which approach to use?
As you might already guessed, I would recommend http://site.tld/[:language]/[:query]
as the more sensible option.
Also in real word situation you would have 3rd major part in URL: "title". As in name of the product in online shop or headline of article in news site.
Example: http://site.tld/en/news/article/121415/EU-as-global-reserve-currency
In this case '/news/article/121415'
would be the query, and the 'EU-as-global-reserve-currency'
is title. Purely for SEO purposes.
Can it be done in Laravel?
Kinda, but not by default.
I am not too familiar with it, but from what I have seen, Laravel uses simple pattern-based routing mechanism. To implement multilingual URLs you will probably have to extend core class(es), because multilingual routing need access to different forms of storage (database, cache and/or configuration files).
It's routed. What now?
As a result of all you would end up with two valuable pieces of information: current language and translated segments of query. These values then can be used to dispatch to the class(es) which will produce the result.
Basically, the following URL: http://site.tld/ru/blog/novinka
(or the version without '/ru'
) gets turned into something like
$parameters = [
'language' => 'ru',
'classname' => 'blog',
'method' => 'latest',
];
Which you just use for dispatching:
$instance = new {$parameter['classname']};
$instance->{'get'.$parameters['method']}( $parameters );
.. or some variation of it, depending on the particular implementation.