Internationalization in PHP 5.3

July 6, 2009

Tutorials

PHP 5.3 has been recently released and one of the new features in core is the
internationalization extension. It allows you to support a
multitude of languages and local formats much easier than before,
without having to learn all the tiny the details of local formats
and rules.

This extension also provides the same functionality through the
"PECL module for PHP 5.2">PECL module for PHP 5.2.

The extension is based on the "http://site.icu-project.org/" title="ICU library">ICU library
provided by IBM.

The Problems

The most frequently encountered problem when bringing an
application to non-English users is not necessarily translating the
text displayed by the application, but doing things like:

  • Sorting textual data according to local rules.
  • Displaying numbers. Characters used to represent number
    properties (sign, decimal point, thousand separator) vary
    widely.
  • Date and time formatting, whether using a different calendar or
    using a local format to represent a base calendar
  • Displaying time in the local timezone, and dealing with users
    in multiple timezones.
  • Representing money values and currencies.
  • Displaying ordinal and cardinal numbers, i.e. numbers
    representing order (1st, 2nd) and quantity (1 error, 2
    errors).
  • Rendering parameterized strings — where the values are inserted
    in pre-existing templates, such as printf() — according to
    local grammar rules.
  • Breaking text into letters, words and sentences according to
    local rules.

There are, of course, many other things to consider, but the
problem areas listed above are those most frequently encountered by
Web developers.

Note that none of these issues are related to the binary
representation of text. They exist independently of the problem of
handling various local encodings, Unicode texts and their
representation. In this article, binary representation will almost
never be mentioned, but most of the functions in PHP 5 assume UTF-8
input and will produce UTF-8 ouput.

"The_Internationalization_Extension">

The Internationalization Extension

The intl extension API is both procedural and object oriented,
so you can write your code either way (not unlike API of "http://php.net/mysqli" title="Ext/mysqli">ext/mysqli):

<?php 
// OO
$collator = new Collator("de_DE");
$collator->compare("string#1", "string#2");

or:

<?php 
// procedural
$coll   = collator_create('de_DE');
$result = collator_compare($coll, "string#1", "string#2");

Both notations are functionally identical. Further in the article
we will be using only one variant, but for almost all functions
both variants exist – please "See the docs">see the docs for more details.

Since the scope of the extension includes a few functionally
independent parts, intl has a number of independent modules,
the idea being that new modules can be added to the extension at a
later date. Each module is represented by a class in OO notation,
and a group of functions in procedural notation.

So far the following modules have been implemented:

  • Locale — deals with breaking locale data
    into components, assembling a locale string from components and
    displaying the names of countries, languages etc in a specified
    locale.
  • Collator — a means of comparing and
    sorting strings according to local rules.
  • Number formatter — allows you
    to format numbers in a variety of ways, and to parse textual
    representations of numbers.
  • Date formatter — allows you to
    format dates and to parse textual representations of dates.
  • Message formatter — allows you
    to compose messages from parameterized strings while formatting the
    data inside according to local rules and allowing choices dependent
    on the actual parameter value.
  • Normalizer — a means of bringing a
    Unicode string to a standard, unambiguous representation.
  • Grapheme module – handles parsing a
    string into a set of graphemes.
  • IDN – handles internationalized domain names
    format

Now let’s look at the intl API in more detail.

Common Functions

The intl extension has
a set of common error functions that serve all its modules:

  • intl_get_error_code() — returns the ICU error code from
    the last operation
  • intl_get_error_message() — returns a textual description
    of the last error
  • intl_is_failure(int $err) — checks if a given error code
    is a failure code
  • intl_error_name(int $err) — returns the ICU internal
    name for a given error code

Note that ICU can return WARNING codes, which indicate
neither success nor failure. These represent a state where the
library was able to perform the function, but not entirely in the
way the user intended—for example, by using the fallback
locale:

<?php 
$coll = collator_create('en_RU');
$err_code = intl_get_error_code();
printf("ICU error code %d: %s.\n", $err_code, intl_error_name($err_code));
// ICU error code -128: U_USING_FALLBACK_WARNING

You could get the same information there by calling:

<?php 
$err_code = collator_get_error_code($coll);

or:

<?php 
$err_code = $coll->getErrorCode();

The difference is that the common intl functions always
return error information about the very last operation, regardless
of the object involved, whereas the object-based methods can only
return error information from the last operation concerning that
specific object. The common functions are particularly convenient
when checking failed constructors, since a failed constructor
leaves no object that could be queried:

<?php 
$mf = new MessageFormatter($locale, $msg);
if (!$mf) {
    print "Failed: ".intl_get_error_message()."\n";
}

Not all the intl classes have constructors — some supply
only static methods — and the common error functions can be useful
there, too.

Locale

The "Term 'locale'">term ‘locale’ identifies a specific group of
people having a common set of requirements for the representation
of computerized data, whether that be a country, a region or simply
a community speaking a common language. In most of the functions in
the intl extension, locale is represented by a string such
as en_US or fr_CA. Since that string representation
exists, the Locale class does not provide a way to create a
locale object. The Locale class does, however, provide a
number of services that are useful when composing and dissecting
locales, or when displaying data about different locales in a
user-friendly format. All the methods in the "http://www.php.net/manual/en/class.locale.php">Locale class
are static, and there are of course equivalent procedural functions
throughout.

A locale string can be parsed into elements using
Locale::parseLocale():

<?php 
$arr = Locale::parseLocale('sl-Latn-IT-nedis');

This method will split the given locale into its
language, script, region and variant
component parts, so that the resulting array from that particular
string would be:

Array
(
    [language] => sl
    [script] => Latn
    [region] => IT
    [variant0] => NEDIS
)

To compose a locale string from its component parts, the
composeLocale() method can be used:

<?php 
$locstring = Locale::composeLocale($arr);
print_r($locstring);
// sl_Latn_IT_NEDIS

The Locale class defines constants that correspond to the
keys of the locale array; Locale::LANG_TAG,
Locale::EXTLANG_TAG, Locale::SCRIPT_TAG,
Locale::REGION_TAG and so on. See the Locale class
documentation in the PHP manual for the full details. You can also
extract just one element of the locale, by using the convenience
methods getPrimaryLanguage(), getRegion() or
getScript():

<?php 
$lang = Locale::getPrimaryLanguage('zh-Hant');
print_r($lang);
// zh

Extracting unnamed parts of the locale string is possible using
getAllVariants(). For example:

<?php 
$arr = Locale::getAllVariants('sl_IT_NEDIS_ROJAZ_1901');
print_r($arr);

produces:

Array
(
    [0] => NEDIS
    [1] => ROJAZ
    [2] => 1901
)

The family of getDisplay*() methods allows the display of
locale part names in the current, or an arbitrary other,
locale:

<?php 
print "Language is: ".Locale::getDisplayLanguage('sl-Latn-IT-nedis', 'en_US');

would produce:

Language is: Slovenian

Current locale is used when locale parameter is null or
omitted.

Locales can also have keywords attached to them. These allow
control over various aspects of the locale. For example,
de_DE@currency=EUR;collation=PHONEBOOK represents a German
locale that specifies the Euro as the currency and ‘phonebook’ as
the sorting order (German has a different sorting order for
dictionaries and telephone directories). The Locale method
getKeywords() allows you to access such keywords:

<?php 
$kw = Locale::getKeywords('de_DE@currency=EUR;collation=PHONEBOOK');
print_r($kw);

would result in:

Array
(
    [collation] => PHONEBOOK
    [currency] => EUR
)

Commonly used keywords include calendar, collation
and currency. You can visit the "http://userguide.icu-project.org/locale">ICU user guide page
about the Locale class for more details of this concept.

Given a locale string, the lookup() method offers a way
to select the best match from a range of locales:

<?php 
$arr = array('de-DEVA','de-DE-1996','de','de-De');
$bestloc = Locale::lookup($arr, 'de-DE-1996-x-prv1-prv2', 'en_US');

The third argument here is the fallback locale that will be used
if none of the proposed matches works. This method may be helpful
when adjusting the locales supported by an application to match
those required by an external source—for example, a browser.
Similarly, the method filterMatches() checks whether an
existing locale string matches the given locale:

<?php 
$loc = 'de-DE-1996-x-prv1-prv2';
if (Locale::filterMatches($loc, 'de_DE', true)) {
    print "It's German!\n";
}

Classes in the intl extension commonly have a
getLocale() method, which returns the locale for which the
object was created. The Locale class defines two constants
that can be passed as arguments to this method:
Locale::ACTUAL_LOCALE and Locale::VALID_LOCALE.
VALID_LOCALE is the most specific locale that was called
(pt_BR rather than pt), and ACTUAL_LOCALE is
the locale whose rules this particular data adheres to. It can
happen that the more specific locale will override some of the
locale rules, but leave others in the domain of the less specific
pt locale. In such a case, the ACTUAL_LOCALE and
VALID_LOCALE values would be the same for some functions,
but different in others.

Finally, the Locale class allows a default locale to be
set. The default can be used by any of the extension classes that
can take Locale::DEFAULT_LOCALE as their locale
parameter. It allows you to set an application-wide default, rather
than repeating a given locale string over and over:

<?php 
Locale::setDefault('en_US');
$coll = new Collator(Locale::DEFAULT_LOCALE);
$default = Locale::getDefault();
print_r($default);
// en_US

Collator

A lot of Web applications display sorted data, thereby enabling
users to more easily navigate through big arrays of information and
find the items they need. In different locales, people may expect
different text orders for sorting. For example, in English,
y goes after x and before z, but in
Lithuanian, y goes between i and k. Also,
accented letters may be treated by some languages as their
unaccented counterparts, and by others as different letters with
their own place in the alphabet. Some accented letters may even be
sorted as two separate letters; the German ä (a with umlaut)
is sorted as the two letters, ae. Casing can also be
different; some languages (like English) put uppercase before
lowercase letters, while others (like Latvian) take the opposite
approach. Even within the same language, text may be sorted
differently in different contexts—as in German, where the sorting
order used in dictionaries and that used in phone books is
different.

The Collator class provides access to the ICU Collator
functionality (see the "http://userguide.icu-project.org/collation" title=
"ICU User Guide page on Collation">ICU User Guide page on
Collation
for details). In order to create a Collator
object, you need to provide a locale string:

<?php 
$coll = new Collator('en_CA');

You can use this object to compare two strings:

<?php 
$res = $coll->compare($s1, $s2);

As is customary in "Comparison functions">comparison functions, compare()
returns 0 when the strings are equal, -1 where
$s1 is less than $s2 and 1 if $s1 is
greater than $s2. You could use this function alongside
built-in array sorting functions such as "http://php.net/usort">usort(), but the Collator class
provides its own sorting function for better performance. The basic
sort function:

<?php 
$coll->sort($array);

will sort an array according to the collation rules associated
with that locale, in much the same way as the regular PHP "http://php.net/sort">sort() function does. It will also accept
the familiar sort flags: Collator::SORT_REGULAR is the
default, and Collator::SORT_NUMERIC and
Collator::SORT_STRING enforce a numeric or string comparison
of elements. When sorting large arrays, sortWithKeys() can
provide an advantage:

<?php 
$coll->sortWithKeys($large_array);

This method will create a set of ‘sort keys’—collator-dependent
representations of the sorted data—to allow quick comparison
between elements. Creating sort keys costs time but makes the
ensuing sort much faster, so consider using sortWithKeys()
where the data set is large.

The "http://www.php.net/manual/en/class.collator.php">Collator
class also has a set of ICU attributes. The first of these defines
collation STRENGTH. Collation strength determines the
specific characters (including punctuation) and character
properties (case, accents) that will be taken into account during
string comparison. The second defines the
NORMALIZATION_MODE, or the way in which character sequences
are brought to a common form; a third, CASE_FIRST,
determines case ordering. There are a pair of dedicated methods for
collation strength, named getStrength() and
setStrength(). "http://docs.php.net/manual/en/class.collator.php#intl.collator-constants"
title="Collator attributes">Everything else
, including
HIRAGANA_QUATERNARY_MODE (required for JIS sort order), uses
getAttribute() and setAttribute().

In order to check whether an operation was successful, the
Collator class supplies getErrorCode() and
getErrorMessage(). It’s worth keeping it in mind that
compare() here will return FALSE if either the
internal UTF-8 to UTF-16 conversion of either input string fails in
PHP 5, for example when umlauts are used in a PHP script encoded in
Latin-1 or similar:

<?php 
$coll = new Collator('lt');
if ($coll->compare('ä', 'ü') === false) {
    print $coll->getErrorMessage();
}
// Error converting first argument to UTF-16: U_INVALID_CHAR_FOUND

Other intl extension classes, such as
NumberFormatter and MessageFormatter, have the same
class methods for displaying errors.

Number Formatter

The NumberFormatter class allows numbers to be formatted
in a variety of ways, and is capable of parsing numbers represented
in a variety of ways, dependent on the locale. To create the
formatter, you need the locale and type of the target format:

<?php 
$nf = new NumberFormatter('en_US', NumberFormatter::CURRENCY);

The following format types are supported; all examples here use
an en_US locale:

  • PATTERN_DECIMAL — formatting is defined by a
    user-supplied pattern describing the rules for placing significant
    digits, separators and additional signs. See "http://icu-project.org/apiref/icu4c/classDecimalFormat.html#_details"
    title="ICU DecimalFormat docs">ICU DecimalFormat docs
    for the
    details of the pattern format.
  • DECIMAL — formatted as a regular decimal number, e.g.
    123.45.
  • CURRENCY — formatted as currency according to locale
    rules. e.g. 123.45 becomes $123.45.
  • PERCENT — formatted as a percentage value, e.g.
    0.45 becomes 45%.
  • SCIENTIFIC — formatted in normalized scientific
    notation, e.g. 123.45 becomes 1.2345E2.
  • SPELLOUT — numbers are spelled out according to locale
    rules, e.g. 123.45 becomes one hundred and twenty-three
    point four five
    .
  • ORDINAL — formatted as an ordinal value according to
    locale rules, e.g. 123 becomes 123rd.
  • DURATION — formatted as time duration according to
    locale rules e.g. 123.45 (seconds) becomes 2:03
    (minutes).
  • PATTERN_RULEBASED — custom formatting as described by a
    user-supplied pattern in the ICU rules format (see "http://icu-project.org/apiref/icu4c/classRuleBasedNumberFormat.html#_details"
    title="ICU RuleBasedNumberFormat docs">ICU RuleBasedNumberFormat
    docs
    for details). You probably will never need this, unless
    you want to do something like spell your numbers in Klingon…
    SPELLOUT, ORDINAL and DURATION use predefined
    PATTERN_RULEBASED formats for the majority of locales.
  • DEFAULT_STYLE — alias for DECIMAL.
  • IGNORE — alias for PATTERN_DECIMAL.

If a pattern is needed for a decimal type, it can be passed as
an optional third constructor argument alongside
NumberFormatter::PATTERN_DECIMAL. If a pattern change is
needed for another type — for instance, PERCENT — it can be
set using the setPattern() method. For example, if you
wanted to control the number of significant digits or the default
formatting of your numeric output, you could do so in this way:

<?php 
$nf = new NumberFormatter('en_US', NumberFormatter::PERCENT);
print $nf->format(.456789); // 46%
$nf->setPattern('@@@');
print $nf->format(.456789); // 0.457
$nf->setPattern('@@');
print $nf->format(.456789); // 0.46
$nf->setPattern('@');
print $nf->format(.456789); // 0.5

The corresponding getPattern() method allows you to
inspect the current pattern. This works for all the rule-based
types, such as DECIMAL, SCIENTIFIC, CURRENCY
and PERCENT.

By default, format() uses the variable type as-is, so
will, for example, format integers as integers and doubles as
doubles. However, you can explicitly specify the type you want in
an optional second parameter:

<?php 
print $nf->format(123.45, NumberFormatter::TYPE_INT32);
// 123

The supported types are TYPE_INT32, TYPE_INT64,
TYPE_DOUBLE and of course TYPE_DEFAULT, which gives
the same result as passing no type specification.

In order to format currency values to suit a different locale
than that currently being used by the application, a more
specialized method exists:

<?php 
$nf = new NumberFormatter('en_GB', NumberFormatter::CURRENCY);
print $nf->formatCurrency(123.45, "USD");
// US$123.45

The second argument here is a 3-letter "http://www.iso.org/iso/support/currency_codes_list-1.htm">ISO 4217
currency code
. Note that what is displayed is controlled by two
factors; the currency format in the current locale, and the
specific currency code passed as an argument. This method is
therefore useful only with formatters created as
NumberFormatter::CURRENCY, otherwise it will just work in
the same way as the regular format() method.

Of course, the currency formatting function knows nothing about
exchange rates and so forth, so the numeric value will be
displayed exactly as supplied. Only the currency code can be
changed.

Numeric values can be parsed using the parse()
method:

<?php 
$nf = new NumberFormatter('en_US', NumberFormatter::DECIMAL);
print $nf->parse("123.45", NumberFormatter::TYPE_INT32);
// 123

Here, the second optional argument specifies the expected type.
This parameter supports TYPE_INT32, TYPE_INT64 and
TYPE_DOUBLE, with the default this time set to
TYPE_DOUBLE. A third optional parameter allows you to
specify the position from which to start parsing, and will be set
to the position at which parsing ended at function return:

<?php 
$nf  = new NumberFormatter('en_US', NumberFormatter::DECIMAL);
$pos = 13;
print $nf->parse("When parsing 123.45 this will only print 123", NumberFormatter::TYPE_INT32, $pos)."\n";
print $pos;
// 123
// 19

When it comes to parsing currency values, parseCurrency()
should be used (again, with the formatter being of type
NumberFormatter::CURRENCY). This method returns a double and
sets the second parameter to the assumed currency code. That
assumption is locale-specific:

<?php 
$nf = new NumberFormatter('en_AU', NumberFormatter::CURRENCY);
print $nf->parseCurrency("$123.45", $currency)."\n";
print $currency."\n";
// 123.45
// AUD

By default, a NumberFormatter object takes all the
necessary settings—such as decimal separators, negative/positive
signs, currency and exponent symbols or the number of digits to
display — from the locale. However, these settings can be
controlled individually in each formatter object.

As with the Collator class there are
getAttribute() and setAttribute() methods, this time
to control display attributes such as FORMAT_WIDTH,
PADDING_POSITION, GROUPING_USED (the separator) and
GROUPING_SIZE. There are also getTextAttribute() and
setTextAttribute() methods, which control all the textual
settings: CURRENCY_CODE, POSITIVE_PREFIX and
NEGATIVE_SUFFIX, rule sets and the PADDING_CHARACTER
to be used. We do not want to try your patience by describing all
the attributes in detail, so please refer to the "http://docs.php.net/manual/en/class.numberformatter.php" title=
"NumberFormatter manual">relevant pages of the PHP manual
for
the full list.

Date Formatter

The "http://docs.php.net/manual/en/class.intldateformatter.php">IntlDateFormatter
class enables you to easily format dates and times according to the
locale formatting rules. There are two ways to format a date:
pattern-based and locale-based.

The locale-based API allows you to choose from pre-set
locale-dependent date and time formats, with each locale defining
short, long and medium formats for displaying dates and times.
Example:

<?php 
$datefmt = new IntlDateFormatter("de-DE",
                             IntlDateFormatter::LONG,
                             IntlDateFormatter::SHORT,
                             date_default_timezone_get());

which allows the formatting of a given date with long date and
short time formats:

<?php 
print $datefmt->format(time());

The valid types are:

  • SHORT is completely numeric, such as 12/13/52 or
    3:30pm.
  • MEDIUM is longer, such as Jan 12, 1952.
  • LONG is yet longer, such as January 12, 1952 or
    3:30:32pm.
  • FULL is pretty completely specified, such as Tuesday,
    April 12, 1952 AD or 3:30:42pm PST.

The format() functions accepts either timestamp or
"Localtime()">localtime()-style array. Unfortunately, as of now
DateTime object is not supported directly (you’d have to
extract timestamp from it) but volunteers are welcome to contribute
the support for it.

Other way is to specify the pattern directly:

<?php 
$fmt = new IntlDateFormatter( "de-DE",     
                IntlDateFormatter::FULL,
                IntlDateFormatter::FULL,
                date_default_timezone_get(),
                IntlDateFormatter::GREGORIAN,
                "MM/dd/yyyy");

This will create the formatter with specified patterns – whole
list of pattern rules can be found in the "http://icu-project.org/apiref/icu4c/classSimpleDateFormat.html#_details"
title="ICU documentation">ICU Date formatter documentation
.

The formatting attributes can be examined with
getDateType(), getTimeType(), getCalendar(),
getPattern(), getTimeZoneID() functions and changed
with respective set functions – setCalendar(),
setPattern(), setTimeZoneID().

The IntlDateFormatter class also allows you to parse date
strings, in much the same way as the parsing capabilities offered
by the NumberFormatter and MessageFormatter
classes:

<?php 
$fmt = new IntlDateFormatter( "de-DE" ,IntlDateFormatter::FULL,
            IntlDateFormatter::FULL,
            date_default_timezone_get(),
            IntlDateFormatter::GREGORIAN  );
echo "Parsed timestamp is ".$fmt->parse("Mittwoch, 31. Dezember 1969 16:00 Uhr GMT-08:00");
// will print: Parsed timestamp is 630201600

Just as NumberFormatter parser does, this parser allows
you to set parsing position in a string as second argument and will
return resulting position after parsing in the same argument.

This function would return result as timestamp. Parsing to
class="new" title="Localtime()">localtime()-style array is
provided by the localtime function, which has the same
syntax but returns an array.

By default, if the input does not exactly conform to what the
formatter would output but can still be parsed as a date, it will
be parsed. This is called “lenient” parsing. Function
setLenient() controls the leniency of the parsing, so
setting it to false will have the parser to adhere to stricter
rules. isLenient() returns currently used setting.

National calendars can be supported through setting
calendar parameter is the locale string.

Message Formatter

While the formatter classes above are very useful when it comes
to displaying individual data pieces in a localized format, it is
often necessary to format whole phrases that include numeric and
other data. The composition of such a phrase may well be different
for different languages. Further, different quantities may require
different forms of display, for example “no files”, “1 file”, “2
files”. This is usually resolved with something like %d
file(s)
, which is not the most natural thing. And this won’t
help where the language has more than just the singular and plural
forms to consider: in Russian, for example, quantities of 1, 2 and
5 would require three different words for file.

The "http://www.php.net/manual/en/class.messageformatter.php">MessageFormatter
class allows you to deal with such problems by creating
locale-dependent format strings and inserting localized format
values at runtime. Note that the class does not have the ability to
choose the correct message for the locale! However, given a message
and a target locale, it would format external data into the message
following the localized rules.

A MessageFormatter object is, therefore, created from a
locale and a message:

<?php 
$en_fmt = new MessageFormatter("en_US", 
    "{0,number,integer} monkeys on {1,number,integer}
    trees make {2,number} monkeys per tree\n");
$de_fmt = new MessageFormatter("de", 
    "{0,number,integer} Affen über {1,number,integer}
    Bäume um {2,number} Affen pro Baum\n");

As said before, the formatter cannot ensure that the message is
correct for the locale — you still need to do that part yourself. A
functional module dealing with
resources
, which is planned for future releases, may be helpful
in this task.

Inserting data into the message is achieved using the
format() method. For example:

<?php 
print $en_fmt->format(array(4560, 123, 4560/123));
print $de_fmt->format(array(4560, 123, 4560/123));

would produce:

4,560 monkeys on 123 trees make 37.073 monkeys per tree
4.560 Affen über 123 Bäume um 37,073 Affen pro Baum

Notice that the numeric values are formatted in the way
appropriate to the target locale. The format() method
receives one argument, consisting of an array of the parameters
needed to fill the gaps in the format string. The argument is
specified there as {index,type,extra data}, with everything
beyond the index being optional.

If the type is not specified, it is derived from the
argument. Besides numbers, the following types of data are
supported:

  • time—displays time value, argument should be a timestamp
    integer
  • date—displays date value, argument should be a timestamp
    integer
  • choice—allows a range of formats depending on the values
    of the arguments:
"{0} resulted in {1,choice,0#no errors|1#single error|1<{1, number} errors}"

The choice format also supports more complex conditions;
please refer to the "http://icu-project.org/apiref/icu4c/classChoiceFormat.html#_details"
title="ICU documentation">ICU ChoiceFormat documentation
for
the full format description.

If the format string is to be used only once, there is a quick
formatting method that can be used to avoid the need to create an
object:

<?php 
$num = 22;
print MessageFormatter::formatMessage('en_US', "number: {0, number}", array($num));
// number: 22

This static method is functionally identical to creating a
MessageFormatter and then calling format(), but it
saves a bit of typing and some engine work to create and destroy
the object.

MessageFormatter can also be used for extracting the data
from formatted strings:

<?php 
$mf = new MessageFormatter("en_US", "{0} monkeys on {1} trees");
print_r($mf->parse("12 monkeys on 7 trees"));
/* output:
Array
(
    [0] => 12
    [1] => 7
)
*/

The parsing function returns an array of values parsed from the
string. Numbers are parsed according to the number formatting rules
described in the earlier section about the NumberFormatter
class. Again, there is a shorter form available; the static method
parseMessage() exists for immediate parsing without creating
the object:

<?php 
print_r(MessageFormatter::parseMessage("en_US",
                        "{0} monkeys on {1} trees",
                        "12 monkeys on 7 trees"));

Given an instantiated MessageFormatter object, you can
view and replace the message by using getPattern() and
setPattern():

<?php 
$mf = new MessageFormatter("en_US", "{0} monkeys on {1} trees\n");
print $mf->getPattern();
// {0} monkeys on {1} trees
$mf->setPattern("{0, number} trees hosting {1, number} monkeys\n");
print $mf->format(array(7, 12));
// 7 trees hosting 12 monkeys

This may come handy when using a variety of different formats in
the same locale, since it saves on the creation and destruction of
objects.

Normalizer

In Unicode, the same complex character can be represented in a
number of ways. For example the letter Å (A with a
ring above) can be represented as the Unicode character
U+00C5, or as a sequence of the letter A and the
Unicode character U+030A (COMBINING RING ABOVE). More
complex characters can have even more variations. While the
displayed result will always be the same regardless of the Unicode
representation of the character, some unique formally defined form
would be optimal when it comes to search, comparison or the use of
keys. Normalization is a process that involves transforming
characters and sequences of characters into a formally defined
underlying representation.

Unicode defines four normalization forms — C, D,
KC and KD. You can find full description of these
forms in "The official Unicode documentation">the official Unicode
documentation
. However, "http://unicode.org/reports/tr15/">normalization form C
(also known as ‘NFC’) is the most commonly used, and also happens
to be the one recommended by W3C.

The "http://www.php.net/manual/en/class.normalizer.php">Normalizer
class in the intl extension comprises just two static
methods, one to normalize a string:

<?php 
$s = Normalizer::normalize("Å", Normalizer::FORM_C);

and one to test whether a string is normalized according to the
given form:

<?php 
if (Normalizer::isNormalized($s, Normalizer::FORM_C)) {
    // ...
}

Note that, as per the Unicode standard, it is safe to repeatedly
normalize a string. Normalizing a string that was already
normalized does not change its data — but is a waste of time, of
course.

Graphemes

As we have seen above, what we percieve as a “character” in the
text can be represented by a number of actual Unicode code points.
Most of string functions in PHP, however, will allow you to only
operate on boundaries of bytes, so that the string “Å” could to be
perceived as two characters or one character depending on
normalization form and other details.

Thus, we have created a number of grapheme versions of
functions, which allow the developer to access strings as sets of
graphemes – i.e. entities that are perceived as characters
in the text (“Å” is a grapheme) regardless of its internal
representation.

These functions mirror regular string functions and are:

  • grapheme_strlen() — Get string length in grapheme units.
  • grapheme_substr() — Return part of a string.
  • grapheme_strstr() — Returns part of haystack string from the
    first occurrence of needle to the end of haystack.
  • grapheme_stristr() — Returns part of haystack string from the
    first occurrence of case-insensitive needle to the end of
    haystack.
  • grapheme_strpos() — Find position (in grapheme units) of first
    occurrence of a string.
  • grapheme_stripos() — Find position (in grapheme units) of first
    occurrence of a case-insensitive string.
  • grapheme_strrpos() — Find position (in grapheme units) of last
    occurrence of a string.
  • grapheme_strripos() — Find position (in grapheme units) of last
    occurrence of a case-insensitive string.

The API of these functions is the same as of their regular
string counterparts.

Also, there is a function that allows to extract part of the
string of the certain size from the text buffer -
grapheme_extract(). This function is very useful when you
need to cut text to a certain length but you do not want to cut in
the middle of the character sequence. The size can be given in
graphemes (default), UTF-8 characters or bytes. Whatever the size
is, the result would always contain whole graphemes.

Example: taking 1 grapheme from the string starting from byte
2.

<?php 
$char_a_ring_nfd = "a\xCC\x8A";  
// 'LATIN SMALL LETTER A WITH RING ABOVE' (U+00E5) normalization form "D"

$char_o_diaeresis_nfd = "o\xCC\x88"; 
// 'LATIN SMALL LETTER O WITH DIAERESIS' (U+00F6) normalization form "D"

print urlencode(grapheme_extract( $char_a_ring_nfd . $char_o_diaeresis_nfd, 1, GRAPHEME_EXTR_COUNT, 2));

// prints o%CC%88

Note that size parameter is a maximum (i.e. function may return
less if there are less data in the string, but will never return
more) and starting position is in bytes, but will be advanced if
it’s not on the character boundary.

IDN

IDN functions implement handling of "http://en.wikipedia.org/wiki/Internationalized_domain_name" title=
"Internationalized Domain Names">Internationalized Domain
Names
. You can convert an UTF-8 string to an encoded domain
name with idn_to_ascii():

<?php 
echo idn_to_ascii('täst.de'); 
// prints xn--tst-qla.de

The reverse conversin is given by idn_to_utf8():

<?php 
echo idn_to_utf8('xn--tst-qla.de'); 
// prints täst.de

The functions accept options bitmask as second optional
argument, valid options are:

Future Directions

The modules described above represent the first version of the
implementation of those ICU APIs we felt were the most important.
However, work on the ICU extension is by no means complete, and we
expect to add more functionality that will bring other ICU
capabilities into the hands of the PHP programmer.

We see at least following directions that may be addressed – and
of course welcome volunteers to contribute:

ResourceHandler

The ResourceHandler class is the one we referred to
during the discussion about message formatting. Having this module
in place will give PHP programmers access to the ICU resource
bundles. These resource bundles contain texts for multiple locales,
as a single entity. Since this format is already a standard for
existing ICU applications and there are tools for working with it,
giving PHP programmers the option to use ICU resource bundles will
be especially beneficial for heterogeneous environments in which
multiple applications built in different languages co-exist.

The API will allow you to create a locale-dependant resource
bundle:

<?php 
$res = new ResourceBundle("filename", "en_US");

You could then access strings within that bundle for use in
other areas:

<?php 
$var = new MessageFormatter("en_US", $res->get("myMessage"));

More documentation on ICU resources can be found in the "http://userguide.icu-project.org/locale/resources">ICU user
guide
.

Other functions

Next parts to implement for internalization would be:

  • Transliteration – i.e. ability to represent one language in a
    set of characters from another language (such as spelling out
    Russian word in Enlgish letters).
  • TextIterator – this class, which is implemented in PHP 6, gives
    an ability to iterate texts by character, grapheme (see above),
    word or sentence.
  • StringSearch – this functionality for allow to find characters
    or substrings in strings, like strstr() etc. do, with following
    improvements: allowing accented letters be treated as non-accented
    ones (i.e. ‘Å’ vs. ‘A’) or differently, depending on the language
    context, understand ligatures (like ‘æ’ or ‘ß’ in German or ‘ch’ in
    Spanish), do case-insensitive matches with account for all language
    peculiarities, ignore punctuation if user asks so, etc.

Conclusion

PHP is the one of the most used programming languages on the
Web. As more and more people around the world come to rely on the
Internet for a variety of services and needs, PHP developers cannot
afford to write English-only, non-localized applications any more.
Modern environments require Web applications that are able to
interact with the user in their own language, to adopt to the local
culture and to display data in the way that local expectations
dictate. The purpose of the ICU support described in this article
is to help PHP programmers to accomplish these tasks by
implementing the most basic and frequently used
internationalization functions.

From the perspective of the development team, we would like you
to try working with this extension, and to provide your feedback
and any proposals for improvement on the PHP Internationalization
mailing list (php-i18n). You can subscribe to this, and other PHP
mailing lists, through the subscription interface at "http://www.php.net/mailing-lists.php">http://www.php.net/mailing-lists.php.

The project to bring ICU capabilities to PHP was initiated by
developers from Zend Technologies (Stanislav Malyshev), Yahoo! (Ed
Batutis, Addison Phillips, Tex Texin, Kirti Velankar, Andrei
Zmievski) and LiveNation (Dennis Harvey, Vadim Savchuk). IDN code
contributed by Pierre Joye.

6 Responses to “Internationalization in PHP 5.3”

  1. neosag Says:

    how this extension will help in doing so(As mentioned in my subject) ..

    As i m having mixed content, how can i apply encoding function without data loss.

    similar problem is discussed in here also,
    http://stackoverflow.com/questions/2669444/how-to-convert-non-latin-based-encoded-text-into-utf-8-or-make-them-coexist-on-s

    please help me.

    Regards,
    sag

  2. _____anonymous_____ Says:

    Great article! Thank you.
    Internationalization support is a great step forward
    for PHP fans.

    I sometimes build websites in Japanese.
    I am always confused though about file encoding.
    UTF-8, Shift_JIS or EUC-JP.
    Assuming that cell phone access is not a priority,
    it would seem now that UTF-8 is a no brainer in
    order to access the benefits of the 5.3 Internationalization
    API.

    But as a multibyte language (like Chinese/Korean/Vietnamese)are
    there any complications in realizing these benefits?

  3. LydiaGreenway Says:

    i read your article and get information arround the zebd developer .
    i think whenever any body read this hopefully those will be happy.
    thanks for giving the information.

    <a href="http://ezinearticles.com/?Looking-to-Buy-Resveratrol-Ultra-Pure-Online?-Reviewed-on-the-60-Minutes-Show&id=2513600">Resveratrol</a&gt;

  4. ddt Says:

    Good point – in PHP 5.x intl extension supposes all strings (in and out) are UTF-8.

  5. bweirdan Says:

    > The common functions are particularly convenient when checking failed constructors, since a failed constructor leaves no object that could be queried

    Why doesn’t it throw an exception then? From what I know, it’s the only way to really abort the construction process in userland php code – why internal class has to behave so different?

  6. _____anonymous_____ Says:

    It is unclear what encoding should strings passed to the intl functions be