PHP 5.3 has been recently released and one of the new features in core is the internationalization extension. It allows you to support a multitude of languages and local formats much easier than before, without having to learn all the tiny the details of local formats and rules.
This extension also provides the same functionality through the PECL module for PHP 5.2.
The extension is based on the ICU library provided by IBM.
The Problems
The most frequently encountered problem when bringing an application to non-English users is not necessarily translating the text displayed by the application, but doing things like:
- Sorting textual data according to local rules.
- Displaying numbers. Characters used to represent number properties (sign, decimal point, thousand separator) vary widely.
- Date and time formatting, whether using a different calendar or using a local format to represent a base calendar
- Displaying time in the local timezone, and dealing with users in multiple timezones.
- Representing money values and currencies.
- Displaying ordinal and cardinal numbers, i.e. numbers representing order (1st, 2nd) and quantity (1 error, 2 errors).
- Rendering parameterized strings — where the values are inserted in pre-existing templates, such as printf() — according to local grammar rules.
- Breaking text into letters, words and sentences according to local rules.
There are, of course, many other things to consider, but the problem areas listed above are those most frequently encountered by Web developers.
Note that none of these issues are related to the binary representation of text. They exist independently of the problem of handling various local encodings, Unicode texts and their representation. In this article, binary representation will almost never be mentioned, but most of the functions in PHP 5 assume UTF-8 input and will produce UTF-8 ouput.
The Internationalization Extension
The intl extension API is both procedural and object oriented, so you can write your code either way (not unlike API of ext/mysqli):
<?php // OO $collator = new Collator("de_DE"); $collator->compare("string#1", "string#2");or:
<?php // procedural $coll = collator_create('de_DE'); $result = collator_compare($coll, "string#1", "string#2");Both notations are functionally identical. Further in the article we will be using only one variant, but for almost all functions both variants exist - please see the docs for more details.
Since the scope of the extension includes a few functionally independent parts, intl has a number of independent modules, the idea being that new modules can be added to the extension at a later date. Each module is represented by a class in OO notation, and a group of functions in procedural notation.
So far the following modules have been implemented:
- Locale — deals with breaking locale data into components, assembling a locale string from components and displaying the names of countries, languages etc in a specified locale.
- Collator — a means of comparing and sorting strings according to local rules.
- Number formatter — allows you to format numbers in a variety of ways, and to parse textual representations of numbers.
- Date formatter — allows you to format dates and to parse textual representations of dates.
- Message formatter — allows you to compose messages from parameterized strings while formatting the data inside according to local rules and allowing choices dependent on the actual parameter value.
- Normalizer — a means of bringing a Unicode string to a standard, unambiguous representation.
- Grapheme module - handles parsing a string into a set of graphemes.
- IDN - handles internationalized domain names format
Now let's look at the intl API in more detail.
Common Functions
The intl extension has a set of common error functions that serve all its modules:
- intl_get_error_code() — returns the ICU error code from the last operation
- intl_get_error_message() — returns a textual description of the last error
- intl_is_failure(int $err) — checks if a given error code is a failure code
- intl_error_name(int $err) — returns the ICU internal name for a given error code
Note that ICU can return WARNING codes, which indicate neither success nor failure. These represent a state where the library was able to perform the function, but not entirely in the way the user intended—for example, by using the fallback locale:
<?php $coll = collator_create('en_RU'); $err_code = intl_get_error_code(); printf("ICU error code %d: %s.\n", $err_code, intl_error_name($err_code)); // ICU error code -128: U_USING_FALLBACK_WARNING
You could get the same information there by calling:
<?php $err_code = collator_get_error_code($coll);
or:
<?php $err_code = $coll->getErrorCode();
The difference is that the common intl functions always return error information about the very last operation, regardless of the object involved, whereas the object-based methods can only return error information from the last operation concerning that specific object. The common functions are particularly convenient when checking failed constructors, since a failed constructor leaves no object that could be queried:
<?php $mf = new MessageFormatter($locale, $msg); if (!$mf) { print "Failed: ".intl_get_error_message()."\n"; }
Not all the intl classes have constructors — some supply only static methods — and the common error functions can be useful there, too.
Locale
The term 'locale' identifies a specific group of people having a common set of requirements for the representation of computerized data, whether that be a country, a region or simply a community speaking a common language. In most of the functions in the intl extension, locale is represented by a string such as en_US or fr_CA. Since that string representation exists, the Locale class does not provide a way to create a locale object. The Locale class does, however, provide a number of services that are useful when composing and dissecting locales, or when displaying data about different locales in a user-friendly format. All the methods in the Locale class are static, and there are of course equivalent procedural functions throughout.
A locale string can be parsed into elements using Locale::parseLocale():
<?php $arr = Locale::parseLocale('sl-Latn-IT-nedis');
This method will split the given locale into its language, script, region and variant component parts, so that the resulting array from that particular string would be:
Array
(
[language] => sl
[script] => Latn
[region] => IT
[variant0] => NEDIS
)
To compose a locale string from its component parts, the composeLocale() method can be used:
<?php $locstring = Locale::composeLocale($arr); print_r($locstring); // sl_Latn_IT_NEDIS
The Locale class defines constants that correspond to the keys of the locale array; Locale::LANG_TAG, Locale::EXTLANG_TAG, Locale::SCRIPT_TAG, Locale::REGION_TAG and so on. See the Locale class documentation in the PHP manual for the full details. You can also extract just one element of the locale, by using the convenience methods getPrimaryLanguage(), getRegion() or getScript():
<?php $lang = Locale::getPrimaryLanguage('zh-Hant'); print_r($lang); // zh
Extracting unnamed parts of the locale string is possible using getAllVariants(). For example:
<?php $arr = Locale::getAllVariants('sl_IT_NEDIS_ROJAZ_1901'); print_r($arr);
produces:
Array
(
[0] => NEDIS
[1] => ROJAZ
[2] => 1901
)
The family of getDisplay*() methods allows the display of locale part names in the current, or an arbitrary other, locale:
<?php print "Language is: ".Locale::getDisplayLanguage('sl-Latn-IT-nedis', 'en_US');
would produce:
Language is: Slovenian
Current locale is used when locale parameter is null or omitted.
Locales can also have keywords attached to them. These allow control over various aspects of the locale. For example, de_DE@currency=EUR;collation=PHONEBOOK represents a German locale that specifies the Euro as the currency and 'phonebook' as the sorting order (German has a different sorting order for dictionaries and telephone directories). The Locale method getKeywords() allows you to access such keywords:
<?php $kw = Locale::getKeywords('de_DE@currency=EUR;collation=PHONEBOOK'); print_r($kw);
would result in:
Array
(
[collation] => PHONEBOOK
[currency] => EUR
)
Commonly used keywords include calendar, collation and currency. You can visit the ICU user guide page about the Locale class for more details of this concept.
Given a locale string, the lookup() method offers a way to select the best match from a range of locales:
<?php $arr = array('de-DEVA','de-DE-1996','de','de-De'); $bestloc = Locale::lookup($arr, 'de-DE-1996-x-prv1-prv2', 'en_US');
The third argument here is the fallback locale that will be used if none of the proposed matches works. This method may be helpful when adjusting the locales supported by an application to match those required by an external source—for example, a browser. Similarly, the method filterMatches() checks whether an existing locale string matches the given locale:
<?php $loc = 'de-DE-1996-x-prv1-prv2'; if (Locale::filterMatches($loc, 'de_DE', true)) { print "It's German!\n"; }
Classes in the intl extension commonly have a getLocale() method, which returns the locale for which the object was created. The Locale class defines two constants that can be passed as arguments to this method: Locale::ACTUAL_LOCALE and Locale::VALID_LOCALE. VALID_LOCALE is the most specific locale that was called (pt_BR rather than pt), and ACTUAL_LOCALE is the locale whose rules this particular data adheres to. It can happen that the more specific locale will override some of the locale rules, but leave others in the domain of the less specific pt locale. In such a case, the ACTUAL_LOCALE and VALID_LOCALE values would be the same for some functions, but different in others.
Finally, the Locale class allows a default locale to be set. The default can be used by any of the extension classes that can take Locale::DEFAULT_LOCALE as their locale parameter. It allows you to set an application-wide default, rather than repeating a given locale string over and over:
<?php Locale::setDefault('en_US'); $coll = new Collator(Locale::DEFAULT_LOCALE); $default = Locale::getDefault(); print_r($default); // en_US
Collator
A lot of Web applications display sorted data, thereby enabling users to more easily navigate through big arrays of information and find the items they need. In different locales, people may expect different text orders for sorting. For example, in English, y goes after x and before z, but in Lithuanian, y goes between i and k. Also, accented letters may be treated by some languages as their unaccented counterparts, and by others as different letters with their own place in the alphabet. Some accented letters may even be sorted as two separate letters; the German ä (a with umlaut) is sorted as the two letters, ae. Casing can also be different; some languages (like English) put uppercase before lowercase letters, while others (like Latvian) take the opposite approach. Even within the same language, text may be sorted differently in different contexts—as in German, where the sorting order used in dictionaries and that used in phone books is different.
The Collator class provides access to the ICU Collator functionality (see the ICU User Guide page on Collation for details). In order to create a Collator object, you need to provide a locale string:
<?php $coll = new Collator('en_CA');
You can use this object to compare two strings:
<?php $res = $coll->compare($s1, $s2);
As is customary in comparison functions, compare() returns 0 when the strings are equal, -1 where $s1 is less than $s2 and 1 if $s1 is greater than $s2. You could use this function alongside built-in array sorting functions such as usort(), but the Collator class provides its own sorting function for better performance. The basic sort function:
<?php $coll->sort($array);
will sort an array according to the collation rules associated with that locale, in much the same way as the regular PHP sort() function does. It will also accept the familiar sort flags: Collator::SORT_REGULAR is the default, and Collator::SORT_NUMERIC and Collator::SORT_STRING enforce a numeric or string comparison of elements. When sorting large arrays, sortWithKeys() can provide an advantage:
<?php $coll->sortWithKeys($large_array);
This method will create a set of 'sort keys'—collator-dependent representations of the sorted data—to allow quick comparison between elements. Creating sort keys costs time but makes the ensuing sort much faster, so consider using sortWithKeys() where the data set is large.
The Collator class also has a set of ICU attributes. The first of these defines collation STRENGTH. Collation strength determines the specific characters (including punctuation) and character properties (case, accents) that will be taken into account during string comparison. The second defines the NORMALIZATION_MODE, or the way in which character sequences are brought to a common form; a third, CASE_FIRST, determines case ordering. There are a pair of dedicated methods for collation strength, named getStrength() and setStrength(). Everything else, including HIRAGANA_QUATERNARY_MODE (required for JIS sort order), uses getAttribute() and setAttribute().
In order to check whether an operation was successful, the Collator class supplies getErrorCode() and getErrorMessage(). It's worth keeping it in mind that compare() here will return FALSE if either the internal UTF-8 to UTF-16 conversion of either input string fails in PHP 5, for example when umlauts are used in a PHP script encoded in Latin-1 or similar:
<?php $coll = new Collator('lt'); if ($coll->compare('ä', 'ü') === false) { print $coll->getErrorMessage(); } // Error converting first argument to UTF-16: U_INVALID_CHAR_FOUND
Other intl extension classes, such as NumberFormatter and MessageFormatter, have the same class methods for displaying errors.
Number Formatter
The NumberFormatter class allows numbers to be formatted in a variety of ways, and is capable of parsing numbers represented in a variety of ways, dependent on the locale. To create the formatter, you need the locale and type of the target format:
<?php $nf = new NumberFormatter('en_US', NumberFormatter::CURRENCY);
The following format types are supported; all examples here use an en_US locale:
- PATTERN_DECIMAL — formatting is defined by a user-supplied pattern describing the rules for placing significant digits, separators and additional signs. See ICU DecimalFormat docs for the details of the pattern format.
- DECIMAL — formatted as a regular decimal number, e.g. 123.45.
- CURRENCY — formatted as currency according to locale rules. e.g. 123.45 becomes $123.45.
- PERCENT — formatted as a percentage value, e.g. 0.45 becomes 45%.
- SCIENTIFIC — formatted in normalized scientific notation, e.g. 123.45 becomes 1.2345E2.
- SPELLOUT — numbers are spelled out according to locale rules, e.g. 123.45 becomes one hundred and twenty-three point four five.
- ORDINAL — formatted as an ordinal value according to locale rules, e.g. 123 becomes 123rd.
- DURATION — formatted as time duration according to locale rules e.g. 123.45 (seconds) becomes 2:03 (minutes).
- PATTERN_RULEBASED — custom formatting as described by a user-supplied pattern in the ICU rules format (see ICU RuleBasedNumberFormat docs for details). You probably will never need this, unless you want to do something like spell your numbers in Klingon... SPELLOUT, ORDINAL and DURATION use predefined PATTERN_RULEBASED formats for the majority of locales.
- DEFAULT_STYLE — alias for DECIMAL.
- IGNORE — alias for PATTERN_DECIMAL.
If a pattern is needed for a decimal type, it can be passed as an optional third constructor argument alongside NumberFormatter::PATTERN_DECIMAL. If a pattern change is needed for another type — for instance, PERCENT — it can be set using the setPattern() method. For example, if you wanted to control the number of significant digits or the default formatting of your numeric output, you could do so in this way:
<?php $nf = new NumberFormatter('en_US', NumberFormatter::PERCENT); print $nf->format(.456789); // 46% $nf->setPattern('@@@'); print $nf->format(.456789); // 0.457 $nf->setPattern('@@'); print $nf->format(.456789); // 0.46 $nf->setPattern('@'); print $nf->format(.456789); // 0.5
The corresponding getPattern() method allows you to inspect the current pattern. This works for all the rule-based types, such as DECIMAL, SCIENTIFIC, CURRENCY and PERCENT.
By default, format() uses the variable type as-is, so will, for example, format integers as integers and doubles as doubles. However, you can explicitly specify the type you want in an optional second parameter:
<?php print $nf->format(123.45, NumberFormatter::TYPE_INT32); // 123
The supported types are TYPE_INT32, TYPE_INT64, TYPE_DOUBLE and of course TYPE_DEFAULT, which gives the same result as passing no type specification.
In order to format currency values to suit a different locale than that currently being used by the application, a more specialized method exists:
<?php $nf = new NumberFormatter('en_GB', NumberFormatter::CURRENCY); print $nf->formatCurrency(123.45, "USD"); // US$123.45
The second argument here is a 3-letter ISO 4217 currency code. Note that what is displayed is controlled by two factors; the currency format in the current locale, and the specific currency code passed as an argument. This method is therefore useful only with formatters created as NumberFormatter::CURRENCY, otherwise it will just work in the same way as the regular format() method.
Of course, the currency formatting function knows nothing about exchange rates and so forth, so the numeric value will be displayed exactly as supplied. Only the currency code can be changed.
Numeric values can be parsed using the parse() method:
<?php $nf = new NumberFormatter('en_US', NumberFormatter::DECIMAL); print $nf->parse("123.45", NumberFormatter::TYPE_INT32); // 123
Here, the second optional argument specifies the expected type. This parameter supports TYPE_INT32, TYPE_INT64 and TYPE_DOUBLE, with the default this time set to TYPE_DOUBLE. A third optional parameter allows you to specify the position from which to start parsing, and will be set to the position at which parsing ended at function return:
<?php $nf = new NumberFormatter('en_US', NumberFormatter::DECIMAL); $pos = 13; print $nf->parse("When parsing 123.45 this will only print 123", NumberFormatter::TYPE_INT32, $pos)."\n"; print $pos; // 123 // 19
When it comes to parsing currency values, parseCurrency() should be used (again, with the formatter being of type NumberFormatter::CURRENCY). This method returns a double and sets the second parameter to the assumed currency code. That assumption is locale-specific:
<?php $nf = new NumberFormatter('en_AU', NumberFormatter::CURRENCY); print $nf->parseCurrency("$123.45", $currency)."\n"; print $currency."\n"; // 123.45 // AUD
By default, a NumberFormatter object takes all the necessary settings—such as decimal separators, negative/positive signs, currency and exponent symbols or the number of digits to display — from the locale. However, these settings can be controlled individually in each formatter object.
As with the Collator class there are getAttribute() and setAttribute() methods, this time to control display attributes such as FORMAT_WIDTH, PADDING_POSITION, GROUPING_USED (the separator) and GROUPING_SIZE. There are also getTextAttribute() and setTextAttribute() methods, which control all the textual settings: CURRENCY_CODE, POSITIVE_PREFIX and NEGATIVE_SUFFIX, rule sets and the PADDING_CHARACTER to be used. We do not want to try your patience by describing all the attributes in detail, so please refer to the relevant pages of the PHP manual for the full list.
Date Formatter
The IntlDateFormatter class enables you to easily format dates and times according to the locale formatting rules. There are two ways to format a date: pattern-based and locale-based.
The locale-based API allows you to choose from pre-set locale-dependent date and time formats, with each locale defining short, long and medium formats for displaying dates and times. Example:
<?php $datefmt = new IntlDateFormatter("de-DE", IntlDateFormatter::LONG, IntlDateFormatter::SHORT, date_default_timezone_get());
which allows the formatting of a given date with long date and short time formats:
<?php print $datefmt->format(time());
The valid types are:
- SHORT is completely numeric, such as 12/13/52 or 3:30pm.
- MEDIUM is longer, such as Jan 12, 1952.
- LONG is yet longer, such as January 12, 1952 or 3:30:32pm.
- FULL is pretty completely specified, such as Tuesday, April 12, 1952 AD or 3:30:42pm PST.
The format() functions accepts either timestamp or localtime()-style array. Unfortunately, as of now DateTime object is not supported directly (you'd have to extract timestamp from it) but volunteers are welcome to contribute the support for it.
Other way is to specify the pattern directly:
<?php $fmt = new IntlDateFormatter( "de-DE", IntlDateFormatter::FULL, IntlDateFormatter::FULL, date_default_timezone_get(), IntlDateFormatter::GREGORIAN, "MM/dd/yyyy");
This will create the formatter with specified patterns - whole list of pattern rules can be found in the ICU Date formatter documentation.
The formatting attributes can be examined with getDateType(), getTimeType(), getCalendar(), getPattern(), getTimeZoneID() functions and changed with respective set functions - setCalendar(), setPattern(), setTimeZoneID().
The IntlDateFormatter class also allows you to parse date strings, in much the same way as the parsing capabilities offered by the NumberFormatter and MessageFormatter classes:
<?php $fmt = new IntlDateFormatter( "de-DE" ,IntlDateFormatter::FULL, IntlDateFormatter::FULL, date_default_timezone_get(), IntlDateFormatter::GREGORIAN ); echo "Parsed timestamp is ".$fmt->parse("Mittwoch, 31. Dezember 1969 16:00 Uhr GMT-08:00"); // will print: Parsed timestamp is 630201600
Just as NumberFormatter parser does, this parser allows you to set parsing position in a string as second argument and will return resulting position after parsing in the same argument.
This function would return result as timestamp. Parsing to localtime()-style array is provided by the localtime function, which has the same syntax but returns an array.
By default, if the input does not exactly conform to what the formatter would output but can still be parsed as a date, it will be parsed. This is called "lenient" parsing. Function setLenient() controls the leniency of the parsing, so setting it to false will have the parser to adhere to stricter rules. isLenient() returns currently used setting.
National calendars can be supported through setting calendar parameter is the locale string.
Message Formatter
While the formatter classes above are very useful when it comes to displaying individual data pieces in a localized format, it is often necessary to format whole phrases that include numeric and other data. The composition of such a phrase may well be different for different languages. Further, different quantities may require different forms of display, for example "no files", "1 file", "2 files". This is usually resolved with something like %d file(s), which is not the most natural thing. And this won't help where the language has more than just the singular and plural forms to consider: in Russian, for example, quantities of 1, 2 and 5 would require three different words for file.
The MessageFormatter class allows you to deal with such problems by creating locale-dependent format strings and inserting localized format values at runtime. Note that the class does not have the ability to choose the correct message for the locale! However, given a message and a target locale, it would format external data into the message following the localized rules.
A MessageFormatter object is, therefore, created from a locale and a message:
<?php $en_fmt = new MessageFormatter("en_US", "{0,number,integer} monkeys on {1,number,integer} trees make {2,number} monkeys per tree\n"); $de_fmt = new MessageFormatter("de", "{0,number,integer} Affen über {1,number,integer} Bäume um {2,number} Affen pro Baum\n");
As said before, the formatter cannot ensure that the message is correct for the locale — you still need to do that part yourself. A functional module dealing with resources, which is planned for future releases, may be helpful in this task.
Inserting data into the message is achieved using the format() method. For example:
<?php print $en_fmt->format(array(4560, 123, 4560/123)); print $de_fmt->format(array(4560, 123, 4560/123));
would produce:
4,560 monkeys on 123 trees make 37.073 monkeys per tree 4.560 Affen über 123 Bäume um 37,073 Affen pro Baum
Notice that the numeric values are formatted in the way appropriate to the target locale. The format() method receives one argument, consisting of an array of the parameters needed to fill the gaps in the format string. The argument is specified there as {index,type,extra data}, with everything beyond the index being optional.
If the type is not specified, it is derived from the argument. Besides numbers, the following types of data are supported:
- time—displays time value, argument should be a timestamp integer
- date—displays date value, argument should be a timestamp integer
- choice—allows a range of formats depending on the values of the arguments:
"{0} resulted in {1,choice,0#no errors|1#single error|1<{1, number} errors}"
The choice format also supports more complex conditions; please refer to the ICU ChoiceFormat documentation for the full format description.
If the format string is to be used only once, there is a quick formatting method that can be used to avoid the need to create an object:
<?php $num = 22; print MessageFormatter::formatMessage('en_US', "number: {0, number}", array($num)); // number: 22
This static method is functionally identical to creating a MessageFormatter and then calling format(), but it saves a bit of typing and some engine work to create and destroy the object.
MessageFormatter can also be used for extracting the data from formatted strings:
<?php $mf = new MessageFormatter("en_US", "{0} monkeys on {1} trees"); print_r($mf->parse("12 monkeys on 7 trees")); /* output: Array ( [0] => 12 [1] => 7 ) */
The parsing function returns an array of values parsed from the string. Numbers are parsed according to the number formatting rules described in the earlier section about the NumberFormatter class. Again, there is a shorter form available; the static method parseMessage() exists for immediate parsing without creating the object:
<?php print_r(MessageFormatter::parseMessage("en_US", "{0} monkeys on {1} trees", "12 monkeys on 7 trees"));
Given an instantiated MessageFormatter object, you can view and replace the message by using getPattern() and setPattern():
<?php $mf = new MessageFormatter("en_US", "{0} monkeys on {1} trees\n"); print $mf->getPattern(); // {0} monkeys on {1} trees $mf->setPattern("{0, number} trees hosting {1, number} monkeys\n"); print $mf->format(array(7, 12)); // 7 trees hosting 12 monkeys
This may come handy when using a variety of different formats in the same locale, since it saves on the creation and destruction of objects.
Normalizer
In Unicode, the same complex character can be represented in a number of ways. For example the letter Å (A with a ring above) can be represented as the Unicode character U+00C5, or as a sequence of the letter A and the Unicode character U+030A (COMBINING RING ABOVE). More complex characters can have even more variations. While the displayed result will always be the same regardless of the Unicode representation of the character, some unique formally defined form would be optimal when it comes to search, comparison or the use of keys. Normalization is a process that involves transforming characters and sequences of characters into a formally defined underlying representation.
Unicode defines four normalization forms — C, D, KC and KD. You can find full description of these forms in the official Unicode documentation. However, normalization form C (also known as 'NFC') is the most commonly used, and also happens to be the one recommended by W3C.
The Normalizer class in the intl extension comprises just two static methods, one to normalize a string:
<?php $s = Normalizer::normalize("Å", Normalizer::FORM_C);
and one to test whether a string is normalized according to the given form:
<?php if (Normalizer::isNormalized($s, Normalizer::FORM_C)) { // ... }
Note that, as per the Unicode standard, it is safe to repeatedly normalize a string. Normalizing a string that was already normalized does not change its data — but is a waste of time, of course.
Graphemes
As we have seen above, what we percieve as a "character" in the text can be represented by a number of actual Unicode code points. Most of string functions in PHP, however, will allow you to only operate on boundaries of bytes, so that the string "Å" could to be perceived as two characters or one character depending on normalization form and other details.
Thus, we have created a number of grapheme versions of functions, which allow the developer to access strings as sets of graphemes - i.e. entities that are perceived as characters in the text ("Å" is a grapheme) regardless of its internal representation.
These functions mirror regular string functions and are:
- grapheme_strlen() — Get string length in grapheme units.
- grapheme_substr() — Return part of a string.
- grapheme_strstr() — Returns part of haystack string from the first occurrence of needle to the end of haystack.
- grapheme_stristr() — Returns part of haystack string from the first occurrence of case-insensitive needle to the end of haystack.
- grapheme_strpos() — Find position (in grapheme units) of first occurrence of a string.
- grapheme_stripos() — Find position (in grapheme units) of first occurrence of a case-insensitive string.
- grapheme_strrpos() — Find position (in grapheme units) of last occurrence of a string.
- grapheme_strripos() — Find position (in grapheme units) of last occurrence of a case-insensitive string.
The API of these functions is the same as of their regular string counterparts.
Also, there is a function that allows to extract part of the string of the certain size from the text buffer - grapheme_extract(). This function is very useful when you need to cut text to a certain length but you do not want to cut in the middle of the character sequence. The size can be given in graphemes (default), UTF-8 characters or bytes. Whatever the size is, the result would always contain whole graphemes.
Example: taking 1 grapheme from the string starting from byte 2.
<?php $char_a_ring_nfd = "a\xCC\x8A"; // 'LATIN SMALL LETTER A WITH RING ABOVE' (U+00E5) normalization form "D" $char_o_diaeresis_nfd = "o\xCC\x88"; // 'LATIN SMALL LETTER O WITH DIAERESIS' (U+00F6) normalization form "D" print urlencode(grapheme_extract( $char_a_ring_nfd . $char_o_diaeresis_nfd, 1, GRAPHEME_EXTR_COUNT, 2)); // prints o%CC%88
Note that size parameter is a maximum (i.e. function may return less if there are less data in the string, but will never return more) and starting position is in bytes, but will be advanced if it's not on the character boundary.
IDN
IDN functions implement handling of Internationalized Domain Names. You can convert an UTF-8 string to an encoded domain name with idn_to_ascii():
<?php echo idn_to_ascii('täst.de'); // prints xn--tst-qla.de
The reverse conversin is given by idn_to_utf8():
<?php echo idn_to_utf8('xn--tst-qla.de'); // prints täst.de
The functions accept options bitmask as second optional argument, valid options are:
- IDNA_ALLOW_UNASSIGNED - allows unassigned codepoints in input.
- IDNA_USE_STD3_RULES - check if the input conforms to STD3 ASCII rules, i.e. does not include characters that can not appear in standard domain name.
- IDNA_DEFAULT - by default unassigned codepoints are not allowed and STD3 rules are not checked.
Future Directions
The modules described above represent the first version of the implementation of those ICU APIs we felt were the most important. However, work on the ICU extension is by no means complete, and we expect to add more functionality that will bring other ICU capabilities into the hands of the PHP programmer.
We see at least following directions that may be addressed - and of course welcome volunteers to contribute:
ResourceHandler
The ResourceHandler class is the one we referred to during the discussion about message formatting. Having this module in place will give PHP programmers access to the ICU resource bundles. These resource bundles contain texts for multiple locales, as a single entity. Since this format is already a standard for existing ICU applications and there are tools for working with it, giving PHP programmers the option to use ICU resource bundles will be especially beneficial for heterogeneous environments in which multiple applications built in different languages co-exist.
The API will allow you to create a locale-dependant resource bundle:
<?php $res = new ResourceBundle("filename", "en_US");
You could then access strings within that bundle for use in other areas:
<?php $var = new MessageFormatter("en_US", $res->get("myMessage"));
More documentation on ICU resources can be found in the ICU user guide.
Other functions
Next parts to implement for internalization would be:
- Transliteration - i.e. ability to represent one language in a set of characters from another language (such as spelling out Russian word in Enlgish letters).
- TextIterator - this class, which is implemented in PHP 6, gives an ability to iterate texts by character, grapheme (see above), word or sentence.
- StringSearch - this functionality for allow to find characters or substrings in strings, like strstr() etc. do, with following improvements: allowing accented letters be treated as non-accented ones (i.e. 'Å' vs. 'A') or differently, depending on the language context, understand ligatures (like 'æ' or 'ß' in German or 'ch' in Spanish), do case-insensitive matches with account for all language peculiarities, ignore punctuation if user asks so, etc.
Conclusion
PHP is the one of the most used programming languages on the Web. As more and more people around the world come to rely on the Internet for a variety of services and needs, PHP developers cannot afford to write English-only, non-localized applications any more. Modern environments require Web applications that are able to interact with the user in their own language, to adopt to the local culture and to display data in the way that local expectations dictate. The purpose of the ICU support described in this article is to help PHP programmers to accomplish these tasks by implementing the most basic and frequently used internationalization functions.
From the perspective of the development team, we would like you to try working with this extension, and to provide your feedback and any proposals for improvement on the PHP Internationalization mailing list (php-i18n). You can subscribe to this, and other PHP mailing lists, through the subscription interface at http://www.php.net/mailing-lists.php.
The project to bring ICU capabilities to PHP was initiated by developers from Zend Technologies (Stanislav Malyshev), Yahoo! (Ed Batutis, Addison Phillips, Tex Texin, Kirti Velankar, Andrei Zmievski) and LiveNation (Dennis Harvey, Vadim Savchuk). IDN code contributed by Pierre Joye.


Comments (Login to leave comments)
Why doesn't it throw an exception then? From what I know, it's the only way to really abort the construction process in userland php code - why internal class has to behave so different?
i think whenever any body read this hopefully those will be happy.
thanks for giving the information.
<a href="http://ezinearticles.com/?Looking-to-Buy-Resveratrol-Ultra-Pure-Online?-Reviewed-on-the-60-Minutes-Show&id=2513600">Resveratrol</a>
Internationalization support is a great step forward
for PHP fans.
I sometimes build websites in Japanese.
I am always confused though about file encoding.
UTF-8, Shift_JIS or EUC-JP.
Assuming that cell phone access is not a priority,
it would seem now that UTF-8 is a no brainer in
order to access the benefits of the 5.3 Internationalization
API.
But as a multibyte language (like Chinese/Korean/Vietnamese)are
there any complications in realizing these benefits?