PHP DOM XML extension encoding processing

September 1, 2009

Tutorials, Zend Framework

I recently worked with PHP’s DOM XML extension while working on href="http://framework.zend.com/">Zend Framework’s
Zend_Search_Lucene HTML highlighting capabilities, and uncovered some
undocumented features and issues with the extension in regards to character
encoding. The information contained in this article should also apply to
other libxml-based DOM implementations, as PHP’s DOM extension simply wraps
that library.

1. Internal document encoding is always UTF-8

The DOMDocument::$encoding property doesn’t affect anything
except the dump operations – e.g., DOMDocument::saveXML().

You can set in the constructor:


$dom = new DOMDocument('1.0', 'Windows-1251');

or set it later:


$dom = new DOMDocument('1.0');
$dom->encoding = 'Windows-1251';

but the internal document representation encoding is always UTF-8.

2. Input data is always treated as UTF-8

The following methods for adding textual content to a document:

  • DOMDocument::createTextNode()
  • DOMDocument::createComment()
  • DOMDocument::createCDATASection()
  • DOMDocumentFragment::appendXML()

always treat input data as UTF-8 strings.

Moreover, it can be tested by passing an incorrect UTF-8 string as an input
of DOMDocumentFragment::appendXML(). It issues the warning:

DOMDocumentFragment::appendXML(): Entity: line 1: parser error : Input is
not proper UTF-8, indicate encoding !

without regard for the encoding property value. (It’s necessary to note that
we can’t “indicate encoding” for appendXML() input since it is an XML
fragment, not a full XML document).

The other methods mentioned above process input data “as is” without any
transformation and don’t throw warnings. Problems will come later, however,
when you try to dump the document to a string:

"Warning: DOMDocument::saveXML(): output conversion failed due to conv
error, ..."

3. Text nodes and CDATA are stored as UTF-8 without transformations

The DOMNode::$nodeValue, DOMText::$wholeText,
DOMCharacterData::$data, and DOMComment::$data
properties also store data without any transformations, i.e. in UTF-8.

The only exception is the case when we add non-UTF-8 data manually. It
creates an incorrect DOM tree and, as mentioned above, we will have
problems with serializing the document (“output conversion failed” error).

The DOMXPath class also always works in UTF-8 mode. It’s
concerned with XPath expressions as well as with retrieved nodesets.

Both DOMNode::$nodeValue and (as it will be mentioned below)
the result of the DOMDocument::saveXML() method used in XML
subtree dump mode are in UTF-8.

4: Document encoding does not affect loading behavior

The encoding property (DOMDocument::$encoding) value also
doesn’t affect DOMDocument::loadXML() or
DOMDocument::loadHTML().

The only way to declare the document encoding we want to parse is to declare
it explicitly in the document header.

For XML, this should look familiar:

<?xml version="1.0" encoding="Windows-1251"?>

For HTML, you need to declare a charset in a Content-Type meta http-equiv
tag:

<html>
<head>
    <meta http-equiv="Content-Type" content="text/html; charset=Windows-1251">

Note: the DOMDocument encoding attribute is overloaded while
parsing with the corresponding document header value.

If encoding is not declared in the XML/HTML header, the input string is
parsed as

  • a UTF-8 string by loadXML();
  • an ISO-8859-1 string by loadHTML() (!!!!) corresponding to the HTTP 1.1 standard (RFC2068, section 3.7.1);
  • and DOMDocument::$encoding is set to null.

Clearly, problems of correct document encoding transformation are more
difficult to be solved for HTML parsing than XML, as the latter has a more
formal specification and stricter rules, XML encoding is declared in the
opening tag, and the default encoding is UTF-8, which covers the whole
Unicode range. So the rest of the article will mostly touch on HTML parsing
problems.

Encoding issues with HTML parsing

Unfortunately, loadHTML() doesn’t always correctly recognize
the defined Content-type HTTP-EQUIV meta tag.

The following things act as blockers:

  • Any non-ASCII symbol occurring before the Content-type HTTP-EQUIV meta tag;
  • Any invalid (from an encoding point of view) symbol occurring in the document. E.g. Content-type meta tag declares ‘charset=UTF-8′, but the actual HTML markup contains non-valid UTF-8 sequences.

The second extremely upsetting thing is that it’s impossible to recognize if
encoding recognition and transformation failed during HTML parsing. In both
cases mentioned above, the input will be processed as an ISO-8859-1 string
and converted to UTF-8 (using ISO-8859-1 => UTF-8 conversion) while
creating the DOM tree, but the DOMDocument::$encoding property
will be “correctly” set to the Content-type tag contents, e.g. Windows-1251.

As a result, we can’t recognize if the encoding of the parsed HTML has been
lost.

The easiest solution would appear to be to utilize loadXML() to
parse the HTML document, as this would enforce encoding more properly.
However, this will often fail to parse the document at all in places where
loadHTML() will be lenient: existence of unpaired tags, HTML
entity support, etc. This, unfortunately, means that a large class of
correct HTML documents can’t be loaded as DOMDocument objects,
simply because you must rely on loadHTML.

That said, it’s possible to use the following workaround:

  1. Insert an additional <head> section with the appropriate Content-type HTTP-EQUIV meta tag immediately following the opening <html> tag.
  2. Optionally convert data to the specified encoding (UTF-8 is the best candidate) using iconv() with the ‘//IGNORE’ postfix added to the target encoding name.
  3. Remove the supplementary <head> section following document
    parsing:

$dom       = DOMDocument::loadHTML($htmlString);
$xpath     = new DOMXPath($dom);
$dummyHead = $xpath->query('/html/head')->item(0);
$dummyHead->parentNode->removeChild($dummyHead);

The above approach has the following limitations:

  • The HTML markup must not contain binary data.
  • The page must not contain non-ASCII symbols before the opening <html> tag (there may be page comments).
  • It requires manual recognition of the input type: a complete HTML page or an HTML fragment (HTML fragment has to be wrapped with <html><head>…</head><body>…</body><html> tags to indicate encoding).

How can we recognize page encoding? In many cases it’s already known.

One option is to parse the document twice. The first pass gives the
recognized encoding in the DOMDocument::$encoding property.
This way is usable for small documents since it doesn’t produce too much
overhead with the additional pass.

The universal and most correct way is to parse the beginning of the document
manually using href="http://php.net/manual/en/book.xmlreader.php">XMLReader or the
XML Parser PHP extension
(the choice is up to your personal preferences). This approach also provides
the ability to check a) if an input string is a complete HTML document or
just an HTML fragment; and b) if the document contains comments before the
opening <html> tag (and allows you to remove them and later insert
them into the loaded document).

I also want to draw your attention to the following characteristics of
DOMDocument::loadHTML()‘s behavior:

  • If an input is an HTML fragment, it’s automatically wrapped by <html><body>…</body></html> (if we’ve already done it, this feature is not important for us).
  • All tags are automatically converted to lower case.
  • The tag auto-closing algorithm differs from the algorithm used by DOMDocument::loadXML() method in recovery mode (the HTML auto-closing algorithm is less greedy).
  • DOMDocument::loadHTML() checks if a tag is allowed within the current context. E.g. a <head> tag as not allowed within a <body> section.
  • DOMDocument::loadHTML() “knows” about standard HTML entities.

5. Save/dumping operations and encoding

The DOMDocument::saveXML() and
DOMDocument::saveHTML() operations use the following rules to
define output encoding:

  • The entire XML dump (using DOMDocument::saveXML() method)
    encoding is defined by the DOMDocument::$encoding property.
    It’s actually the only case when this property is used. Characters which are
    not included in the specified character set are dumped as character
    references (&#XXXX;). If encoding is not recognized during parsing or not
    mentioned during document creation (the DOM extension doesn’t allow setting
    the encoding attribute to null or any non-valid value later), then output
    will be created in ASCII and all non-ASCII characters will be dumped as
    character references.
  • Node or XML subtree dumping using the
    DOMDocument::saveXML($node) method is always performed in
    UTF-8.
  • HTML document dump encoding is defined by the first /html/head/meta http-equiv Content-type tag (and is not affected by DOMDocument::$encoding). Important note: the Content-type tag is searched in case-sensitive mode, so ‘html’, ‘head’ and ‘meta’ tag names as well as the ‘content’ attribute name must be in lower case. Characters which are not included in the specified character set (all non-ASCII symbols, if content-type tag is not present or contains non-valid symbols from the libxml point of view), are dumped as character references (&#XXXX;). So we can change the dumped document encoding if necessary (e.g. set it to UTF-8):
    
    $dom = DOMDocument::loadHTML($htmlString);
    ...
    $xpath    = new DOMXPath($dom);
    $metaTags = $xpath->query('/html/head/meta');
    
    // Unfortunately DOMXPath supports only XPath 1.0 and we have to iterate
    // through meta tags instead of selecting the node using
    // '/html/head/meta[lower-case(@http-equiv)="content-type"]'
    // (fn:lower-case() function came with XPath 2.0)
    for ($count = 0; $count < $metaTags->length; $count++) {
        $httpEquivAttribute = $metaTags->item($count)->attributes->getNamedItem('http-equiv');
        if ($httpEquivAttribute !== null  
            && strtolower($httpEquivAttribute->value) == 'content-type'
        ) {
            $fragment = $doc->createDocumentFragment();
            $fragment->appendXML('<meta http-equiv="Content-type" content="text/html; charset=UTF-8"/>');
            $metaTags->item($count)->parentNode->replaceChild($fragment, $metaTags->item($count));
            break;
        }
    }
    // Do nothing if meta tag is not found
    

    Changing the output encoding doesn’t affect the actual document content -
    the browser performs backward conversion during document loading
    corresponding to the value specified in the Content-Type tag.

,

3 Responses to “PHP DOM XML extension encoding processing”

  1. expablo Says:

    DOMNode::$nodeValue will automatically convert HTML entities in utf-8 chars (When node type is #text). And not just basic HTML entities like &lt; and &gt; but complex ones too. For example:

    &#21488;&#21271;

    will be converted in (you should see 2 Chinese chars):

    台北

    In this same example, DOMDocument::saveHTML() will return HTML entities as they are. Try this to see it for yourself:

    <?php

    $html = ‘<p>&#21488;&#21271;</p>’;
    $doc = new DOMDocument();
    $doc->loadHTML($html);

    // Dump HTML
    echo $doc->saveHTML();

    // Dump text
    $root = $doc->documentElement;
    foreach($root->childNodes as $child)
    echo $child->nodeValue;

    ?>

    The output will look something like this:

    <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"&gt;
    <html><body><p>&#21488;&#21271;</p></body></html>
    台北

    Tested with PHP 5.3.0

    I believe you should explicitly take care of all input and output, before and after parsing with this DOM extension. Just to make sure, because such behaviors are not documented in, otherwise excellent, official PHP documentation. I’ll go download the PHP source and see if I can figure out all the details. My C is a bit rusty, but I hope I’ll get what I need.

  2. tedmasterweb Says:

    Thank you, thank you, THANK YOU!!!!

    This is an awesome article, extremely well written and addressed the exact issue I was having. I simply cannot thank you enough!

    Sincerely,

    Ted Stresen-Reuter
    http://tedmasterweb.com