I recently worked with PHP's DOM XML extension while working on Zend Framework's Zend_Search_Lucene HTML highlighting capabilities, and uncovered some undocumented features and issues with the extension in regards to character encoding. The information contained in this article should also apply to other libxml-based DOM implementations, as PHP's DOM extension simply wraps that library.

1. Internal document encoding is always UTF-8

The DOMDocument::$encoding property doesn't affect anything except the dump operations - e.g., DOMDocument::saveXML().

You can set in the constructor:


$dom = new DOMDocument('1.0', 'Windows-1251');

or set it later:


$dom = new DOMDocument('1.0');
$dom->encoding = 'Windows-1251';

but the internal document representation encoding is always UTF-8.

2. Input data is always treated as UTF-8

The following methods for adding textual content to a document:

  • DOMDocument::createTextNode()
  • DOMDocument::createComment()
  • DOMDocument::createCDATASection()
  • DOMDocumentFragment::appendXML()

always treat input data as UTF-8 strings.

Moreover, it can be tested by passing an incorrect UTF-8 string as an input of DOMDocumentFragment::appendXML(). It issues the warning:

DOMDocumentFragment::appendXML(): Entity: line 1: parser error : Input is
not proper UTF-8, indicate encoding !

without regard for the encoding property value. (It's necessary to note that we can't "indicate encoding" for appendXML() input since it is an XML fragment, not a full XML document).

The other methods mentioned above process input data "as is" without any transformation and don't throw warnings. Problems will come later, however, when you try to dump the document to a string:

"Warning: DOMDocument::saveXML(): output conversion failed due to conv
error, ..."

3. Text nodes and CDATA are stored as UTF-8 without transformations

The DOMNode::$nodeValue, DOMText::$wholeText, DOMCharacterData::$data, and DOMComment::$data properties also store data without any transformations, i.e. in UTF-8.

The only exception is the case when we add non-UTF-8 data manually. It creates an incorrect DOM tree and, as mentioned above, we will have problems with serializing the document ("output conversion failed" error).

The DOMXPath class also always works in UTF-8 mode. It's concerned with XPath expressions as well as with retrieved nodesets.

Both DOMNode::$nodeValue and (as it will be mentioned below) the result of the DOMDocument::saveXML() method used in XML subtree dump mode are in UTF-8.

4: Document encoding does not affect loading behavior

The encoding property (DOMDocument::$encoding) value also doesn't affect DOMDocument::loadXML() or DOMDocument::loadHTML().

The only way to declare the document encoding we want to parse is to declare it explicitly in the document header.

For XML, this should look familiar:

<?xml version="1.0" encoding="Windows-1251"?>

For HTML, you need to declare a charset in a Content-Type meta http-equiv tag:

<html>
<head>
    <meta http-equiv="Content-Type" content="text/html; charset=Windows-1251">

Note: the DOMDocument encoding attribute is overloaded while parsing with the corresponding document header value.

If encoding is not declared in the XML/HTML header, the input string is parsed as

  • a UTF-8 string by loadXML();
  • an ISO-8859-1 string by loadHTML() (!!!!) corresponding to the HTTP 1.1 standard (RFC2068, section 3.7.1);
  • and DOMDocument::$encoding is set to null.

Clearly, problems of correct document encoding transformation are more difficult to be solved for HTML parsing than XML, as the latter has a more formal specification and stricter rules, XML encoding is declared in the opening tag, and the default encoding is UTF-8, which covers the whole Unicode range. So the rest of the article will mostly touch on HTML parsing problems.

Encoding issues with HTML parsing

Unfortunately, loadHTML() doesn't always correctly recognize the defined Content-type HTTP-EQUIV meta tag.

The following things act as blockers:

  • Any non-ASCII symbol occurring before the Content-type HTTP-EQUIV meta tag;
  • Any invalid (from an encoding point of view) symbol occurring in the document. E.g. Content-type meta tag declares 'charset=UTF-8', but the actual HTML markup contains non-valid UTF-8 sequences.

The second extremely upsetting thing is that it's impossible to recognize if encoding recognition and transformation failed during HTML parsing. In both cases mentioned above, the input will be processed as an ISO-8859-1 string and converted to UTF-8 (using ISO-8859-1 => UTF-8 conversion) while creating the DOM tree, but the DOMDocument::$encoding property will be "correctly" set to the Content-type tag contents, e.g. Windows-1251.

As a result, we can't recognize if the encoding of the parsed HTML has been lost.

The easiest solution would appear to be to utilize loadXML() to parse the HTML document, as this would enforce encoding more properly. However, this will often fail to parse the document at all in places where loadHTML() will be lenient: existence of unpaired tags, HTML entity support, etc. This, unfortunately, means that a large class of correct HTML documents can't be loaded as DOMDocument objects, simply because you must rely on loadHTML.

That said, it's possible to use the following workaround:

  1. Insert an additional <head> section with the appropriate Content-type HTTP-EQUIV meta tag immediately following the opening <html> tag.
  2. Optionally convert data to the specified encoding (UTF-8 is the best candidate) using iconv() with the '//IGNORE' postfix added to the target encoding name.
  3. Remove the supplementary <head> section following document parsing:

$dom       = DOMDocument::loadHTML($htmlString);
$xpath     = new DOMXPath($dom);
$dummyHead = $xpath->query('/html/head')->item(0);
$dummyHead->parentNode->removeChild($dummyHead);

The above approach has the following limitations:

  • The HTML markup must not contain binary data.
  • The page must not contain non-ASCII symbols before the opening <html> tag (there may be page comments).
  • It requires manual recognition of the input type: a complete HTML page or an HTML fragment (HTML fragment has to be wrapped with <html><head>...</head><body>...</body><html> tags to indicate encoding).

How can we recognize page encoding? In many cases it's already known.

One option is to parse the document twice. The first pass gives the recognized encoding in the DOMDocument::$encoding property. This way is usable for small documents since it doesn't produce too much overhead with the additional pass.

The universal and most correct way is to parse the beginning of the document manually using XMLReader or the XML Parser PHP extension (the choice is up to your personal preferences). This approach also provides the ability to check a) if an input string is a complete HTML document or just an HTML fragment; and b) if the document contains comments before the opening <html> tag (and allows you to remove them and later insert them into the loaded document).

I also want to draw your attention to the following characteristics of DOMDocument::loadHTML()'s behavior:

  • If an input is an HTML fragment, it's automatically wrapped by <html><body>...</body></html> (if we've already done it, this feature is not important for us).
  • All tags are automatically converted to lower case.
  • The tag auto-closing algorithm differs from the algorithm used by DOMDocument::loadXML() method in recovery mode (the HTML auto-closing algorithm is less greedy).
  • DOMDocument::loadHTML() checks if a tag is allowed within the current context. E.g. a <head> tag as not allowed within a <body> section.
  • DOMDocument::loadHTML() "knows" about standard HTML entities.

5. Save/dumping operations and encoding

The DOMDocument::saveXML() and DOMDocument::saveHTML() operations use the following rules to define output encoding:

  • The entire XML dump (using DOMDocument::saveXML() method) encoding is defined by the DOMDocument::$encoding property. It's actually the only case when this property is used. Characters which are not included in the specified character set are dumped as character references (&#XXXX;). If encoding is not recognized during parsing or not mentioned during document creation (the DOM extension doesn't allow setting the encoding attribute to null or any non-valid value later), then output will be created in ASCII and all non-ASCII characters will be dumped as character references.
  • Node or XML subtree dumping using the DOMDocument::saveXML($node) method is always performed in UTF-8.
  • HTML document dump encoding is defined by the first /html/head/meta http-equiv Content-type tag (and is not affected by DOMDocument::$encoding). Important note: the Content-type tag is searched in case-sensitive mode, so 'html', 'head' and 'meta' tag names as well as the 'content' attribute name must be in lower case. Characters which are not included in the specified character set (all non-ASCII symbols, if content-type tag is not present or contains non-valid symbols from the libxml point of view), are dumped as character references (&#XXXX;). So we can change the dumped document encoding if necessary (e.g. set it to UTF-8):
    
    $dom = DOMDocument::loadHTML($htmlString);
    ...
    $xpath    = new DOMXPath($dom);
    $metaTags = $xpath->query('/html/head/meta');
    
    // Unfortunately DOMXPath supports only XPath 1.0 and we have to iterate
    // through meta tags instead of selecting the node using
    // '/html/head/meta[lower-case(@http-equiv)="content-type"]'
    // (fn:lower-case() function came with XPath 2.0)
    for ($count = 0; $count < $metaTags->length; $count++) {
        $httpEquivAttribute = $metaTags->item($count)->attributes->getNamedItem('http-equiv');
        if ($httpEquivAttribute !== null  
            && strtolower($httpEquivAttribute->value) == 'content-type'
        ) {
            $fragment = $doc->createDocumentFragment();
            $fragment->appendXML('<meta http-equiv="Content-type" content="text/html; charset=UTF-8"/>');
            $metaTags->item($count)->parentNode->replaceChild($fragment, $metaTags->item($count));
            break;
        }
    }
    // Do nothing if meta tag is not found
    
    Changing the output encoding doesn't affect the actual document content - the browser performs backward conversion during document loading corresponding to the value specified in the Content-Type tag.