internationalization (i18) in php/quercus
This i18n tutorial goes through the process of internationalizing a PHP application.
For most PHP applications, internationalization simply means using gettext to prepare the application for translations. Gettext works by creating seperate files for the translations. Places that contains strings will call a gettext function to lookup translations that reside in files separate from the source code.
The following tutorial goes through the the steps in internationalizing a simple PHP program in Quercus, Caucho Technology's PHP implementation in Java.
Simple PHP Application
Below is a non-internationalized PHP script that outputs a standard English greeting and a date in US format:
<?php echo "Good morning"; printf("%d/%d/%d", 10, 31, 2006); ?>
Our goal is to internationalize this script so that the greeting and date will be in the user's language and format (even though the date can be formatted in the user's locale more easily by other methods, the date in this example will be instrumental in highlighting a common issue with the ordering of variables in another language).
Preparing translation files
First off, The developer would need to extract the strings from the above script into a .po file. The developer can do this manually or the downloadable GNU gettext utilities can do the extraction automatically.
To extract all the double-quoted strings from the script, run the command line xgettext utility:
$xgettext -a filename.php
This will give a human-readable messages.po file like the one below:
# SOME DESCRIPTIVE TITLE. # Copyright (C) YEAR THE PACKAGE'S COPYRIGHT HOLDER # This file is distributed under the same license as the PACKAGE package. # FIRST AUTHOR <EMAIL@ADDRESS>, YEAR. # #, fuzzy msgid "" msgstr "" "Project-Id-Version: PACKAGE VERSION\n" "Report-Msgid-Bugs-To: \n" "POT-Creation-Date: 2006-09-26 02:10-0700\n" "PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" "Last-Translator: FULL NAME <EMAIL@ADDRESS>\n" "Language-Team: LANGUAGE <LL@li.org>\n" "MIME-Version: 1.0\n" "Content-Type: text/plain; charset=CHARSET\n" "Content-Transfer-Encoding: 8bit\n" #: i18n.php:3 msgid "Good morning" msgstr "" #: i18n.php:5 msgid "%d/%d/%d" msgstr ""
The string right after msgstr will be the string that gettext returns as the translation for that msgid string. Before giving the .po file to the translators, the developer needs to set the CHARSET, or character set, of the original/translated strings. The translators should type their translations in this character set. The most common character set is "UTF-8".
Because Quercus natively supports Unicode, Quercus will decode from this character set into Unicode automatically.
Usually, the next step would be to create a machine oriented .mo file from a .po file. However, this step may be skipped because Quercus' gettext library can read both .po and .mo files. It is best sticking with the .po text files because they are human-readable.
Modifying the script
Translated .po file (simplified)
# Canadian French (fr_CA) msgid "" msgstr "" "Content-Type: text/plain; charset=UTF-8\n" #: i18n.php:3 msgid "Good morning" msgstr "Bonjour" #: i18n.php:5 msgid "[_0]/[_1]/[_2]" msgstr "[_1]/[_0]/[_2]"
<?php bindtextdomain("messages", "./locale"); echo _("Good morning"); echo _("[_0]/[_1]/[_2]", 10, 31, 2006); ?>
The most noticeable difference is that printf is not used anymore and that "%d%d%d" has been changed to "[_0]/[_1]/[_2]". Because printf placeholders are not unique and their positions may change in another language, we cannot use printf. Instead, we use Quercus gettext ordered placeholders. Quercus will substitute "[_0]" with the first parameter, "[_1]" with the second, and so on. As seen in the .po file, "[_0]/[_1]/[_2]" has been rearranged to "[_1]/[_0]/[_2]" to represent the day/month/year format as opposed to month/day/year.
We need to call bindtextdomain() before any gettext functions to specify the base directory of the translation files. The first argument to bindtextdomain() is the domain. The domain is simply the prefix of the translation filenames. The default domain is "messages", but it can be changed with the textdomain() function.
The underscore function is an alias of gettext(). It will always use the default gettext settings. Since we're using the underscore function, we should name all our translations files "messages.po".
Organizing translation files
Translation files should be placed in the following directory hierarchy:
The default category is "LC_MESSAGES". For our example, suppose we have 3 translations: German, Australian English, and Canadian French. They should be placed in the following directories:
./locale/de_DE/LC_MESSAGES/messages.po ./locale/en_AU/LC_MESSAGES/messages.po ./locale/fr_CA/LC_MESSAGES/messages.po
The locale is made up of the ISO language code and the ISO country code. Gettext will pick the language depending on the locale of the host machine. If the translation file is not found for this locale, then the original string is returned as-is. The locale can be changed manually, for example, to Canadian French by: