XML and PHP 5

Unless this is your first exposure to PHP, it is probably safe to assume that everyone by now has seen the PHP 5 release announcements or at least have heard that XML support has been given a complete overhaul. I am still surprised with those I talk to who are still developing with PHP 4 and actively working with XML, rather than upgrading to PHP 5 to use the new XML tool sets. I guess it might be a bit scary for them to think about such an upgrade. Both ext/xslt and ext/domxml have been removed from the core PHP distribution and replaced with completely new extensions, resulting in the need to review and typically re-write all older code relying on either of those extensions in order to upgrade.

What most people don’t realize is that upgrading the code is really not all that difficult. In fact, the benefits of doing so, even for those not yet developing with XML or services, are significant. When an upgrade can provide a developer with simpler tool sets, faster performance and better utilization of system resources, business-wise, it just makes sense to upgrade, as it ultimately affects both productivity and the bottom line for the company. Personally, the benefits from the new XML tools were so great that I went into production on a few company servers, hosting some of our XML based applications, while PHP 5.1 was still in beta.

BACKGROUND

As of the PHP 5.2.x releases, there are a good number of extensions for working with XML data in a variety of ways. These can be broken down into the following categories: tree based, streaming, event based and transformation. Tree based parsers, encompassing ext/dom and ext/simplexml, work on an XML document in-memory allowing it to be manipulated in any manner. Streaming APIs, such as ext/xmlreader and ext/xmlwriter, allow for XML to be read or written to/from PHP streams, resulting in very low memory usage but providing very focused and uni-directional XML support. Event based parsers, such as the familiar ext/xml from PHP 4, view an XML document as events that are fired off as different portions of the document are accessed. Lastly, the XSL extension provides support for transforming an XML document to another XML document.

The goal for XML support in PHP 5 was not only to provide a solid base of tool sets for working with XML, but also to provide some unity amongst the tool sets themselves. The libxml2 and libxslt libraries, from the GNOME project, were chosen as the foundation for the XML support. Not only are these some of the fastest libraries, but as they work together, it became possible to allow different XML based extensions to inter-operate with little to no additional overhead. For example, under PHP 4, one might build an XML data tree using domxml and perform XSL transformations using the xslt extension. In order to pass the data, the XML tree from ext/domxml would need to be serialized to either a string or a file in order to be used by ext/xslt. Working with XML is already time and memory consuming enough. These additional required steps just compound the problem and ultimately decrease the number of users your hardware can serve at a time. In short, it can cost your company money, not only due to the time developers in your company might spend trying to optimize code to help alleviate the situation, but the additional overhead most likely resulted in the need for better and possibly additional hardware.

Things are a bit different using XML in PHP 5. In a similar scenario, you could use one of the new native extensions, like DOM or SimpleXML, to manipulate the XML data tree. This tree can then be passed directly to and used by the XSL extension without incurring any additional overhead. There is no serialization nor copying involved; ext/xsl is able to work directly with the XML tree it is given, resulting in a significant improvement in use of system resources as compared to coding a similar task in PHP 4. So now that you have an idea why XML support got a face lift, you might like to know about the different tools available to you.

DOM

The XML revolution in PHP began with ext/dom. The domxml extension in PHP 4 was plagued with a number of problems; one of the worst being that until the time of the 4.3.x releases, when the extension finally conformed more to the W3C DOM specifications, the API was constantly changing. Granted ext/domxml never officially made it out of experimental status until 4.3.10, this simple fact should make one think twice if they happen to still be actively using domxml. In any event, prior to the 4.3 releases, it was always a question whether upgrading to a newer version of PHP would result in breakage of existing code relying on domxml. Memory leaks and high memory usage were just other problems that seemed to dog ext/domxml. While many of these issues have been addressed over the years to some degree, in whole, they were all just too great to fix while also maintaining backwards compatibility, so it was re-written from scratch, making sure all prior problems and issues were addressed from the start, and finally emerged as ext/dom in PHP 5. A major plus for the new extension is that from the beginning, it followed and conformed to W3C DOM specifications, allowing developers coming from other languages to easily start writing DOM code without having to learn a new API.

Earlier on, I had mentioned that it is often not too difficult to migrate code from ext/domxml to use ext/dom. The biggest catch here is that the code must have been written using the PHP 4.3.x W3C DOM conforming functions. Using the document in Listing 1 as the base XML document, take a look at the PHP 4 code in Listing 2, using ext/domxml, and the PHP 5 code in Listing 3, using ext/dom.

Listing 1

<?xml version="1.0"?>
<article>
   <name>XML in PHP 5
   <author>Rob Richards
</article>

Listing 2

<?php
$doc = domxml_open_file('article.xml');
$root = $doc->document_element();
$node = $root->first_child();

while ($node) {
   if (($node->node_type() == XML_ELEMENT_NODE) && 
      ($node->node_name() == 'name')) {
      	$content = $node->first_child();
            $output = $content->node_value();
            print "Output: $output
";
      	break;
      }
   $node = $node->next_sibling();
}
?>

Listing 3

<?php
$doc = new DOMDocument();
$doc->load('article.xml');
$root = $doc->documentElement;
$node = $root->firstChild;

while ($node) {
   if (($node->nodeType == XML_ELEMENT_NODE) && 
      ($node->nodeName == 'name')) {
      	$content = $node->firstChild;
            $output = $content->nodeValue;
            print "Output: $output
";
      	break;
      }
   $node = $node->nextSibling;
}
?>

Comparing the code from the two listings, you should notice that the code is quite similar. The majority of code I have seen using ext/domxml in fact only uses a handful of functions. Upgrading the code to ext/dom, assuming the W3C based functions were used, typically involves a simple removal of the underscore, a change to use camelCase, and a quick check to determine if the function is now a property. For instance, $node->next_sibling() in ext/domxml becomes $node->nextSibling in ext/dom. I have found that around 75% of the cases I run into fall into this category, which once a developer gets comfortable making these, with the aid of some search/ replace tools, can upgrade even large ext/domxml based code sets in under a day.

SimpleXML

DOM allows a developer to access and manipulate XML in any way needed, but it comes at a price. It is a large and complex API, requiring a developer to really understand all the intricate details of working with XML and in fact scares many beginners from even wanting to work with XML. SimpleXML aims to break through all the XML complexities and provide an intuitive and simple, hence the name, API to work with a document.

The vast majority of people working with XML are really only concerned with elements having simple content and maybe the occasional attribute. Instead of trying to wrap your head around trees, nodes and the plethora of methods just to work with this small subset, SimpleXML takes an easier approach and views a document as an object. Elements are represented as properties and attributes as accessors. Using the document from Listing 1, SimpleXML can perform the same functionality as the DOM code from Listing 3, yet in much clearer and compact syntax, as demonstrated in the following example:

<?php
$sxe = simplexml_load_file('article.xml');
print "Output: ".$sxe->name."
";
?>

What took around 20 lines of code using ext/dom can be performed in 2 lines with SimpleXML.

This extension is also a blessing for developers consuming REST based services. In only a few lines of code a service can be called and the results accessed. In the following example, SimpleXML is used to query the Yahoo! Search web search service for the first matching result. Once it has made the call, the title and url to the match are output.

<?php
$terms = urlencode('php 5 xml new');
$url = 'http://api.search.yahoo.com/WebSearchService/V1/webSearch';
$query = '?appid=demo&query='.$terms.'&results=1';

$serviceurl = $url.$query;

$results = simplexml_load_file($serviceurl);

print $results->Result->Title."
";
print $results->Result->DisplayUrl."
";
?>

Which will output:

Zend Developer Zone | XML in PHP 5 - What's New?
www.zend.com/php5/articles/php5-xmlphp.php

XMLReader

Working with XML is typically memory intensive and slow. For these reasons, when only requiring read-only access to a document, ext/xml, the event based parser, has usually been a developers first choice of APIs. Because only portions of a document reside in memory at a time, and events fired off as the document is read, system resource usage is minimal while at the same time allowing document access as it is parsed rather than after the entire document has been parsed. In order to reap these benefits, a developer needs to deal with the limitations of ext/xml. The document cannot be validated as it is parsed, a developer needs to understand how to code using callbacks tied to events and lastly, there is limited namespace support. The XMLReader extension, in my opinion, is the ultimate replacement for ext/xml. Not only do you get the same benefits that ext/xml has to offer, but none of the drawbacks.

Working with XMLReader couldn’t be any simpler. There is no need to deal with callbacks and mapping functions. You as the developer is in control of the forward movement through the document and what information gets accessed. Simply set the input to be accessed and tell the reader when and where to move. It is so simple that an an entire document can be accessed in only a few lines of code.

<?php 
$reader = new XMLReader();
$reader->open('article.xml');

while ($reader->read()) {
   if ($reader->nodeType == XMLREADER::ELEMENT) {
      print $reader->localName."
";
   }
}
?>

Resulting in the following output:

article
name
author

In addition to its simplicity, it also provides advanced features such as document validation via DTD, RelaxNG and XML Schemas. When performing validation in DOM using XML Schemas, you are forced to first load the entire document and then check the validity. With XMLReader, you have the ability to determine the validity of a document while it is being parsed, giving you the option to stop further parsing in the event it fails along the way.

Probably the most advanced feature is the ability to retrieve entire subtrees during parsing. As XMLReader is a streaming parser, normally you only have access to the small piece of the document currently in memory. XMLReader also provides the capability to inter-operate with DOM by allowing the subtree at the current location to be expanded into a DOM tree, allowing DOM operations to be performed, such as appending portions of the document into another DOM based document. The expanded subtree is only a copy so there are limitations to this feature.

XMLWriter

The flip side to XMLReader is XMLWriter. Have you ever wanted to find a simple and intuitive way to create XML documents that at the same time insures the document is well formed? XMLWriter was created for this specific purpose. It is light weight and will stream the document to the desired output destination, thus keeping memory usage low just like XMLReader.

<?php
$writer = new XMLWriter();
$writer->openURI('php://output');
$writer->startDocument("1.0");
$writer->startElement("example");
$writer->startElement("specchars");
$writer->text('&');
$writer->endDocument();
$writer->flush();
?>

<?xml version="1.0"?>
<example><specchars>&</specchars></example>

From the output you can see that not only did XMLWriter make sure that the document is structurally sound by properly closing all open elements, but it also took care of properly escaping the data. One of the biggest problems I see people running into is the use of an ampersand character “&”. In it’s raw state, this character is invalid in XML so must be escaped “&”. Compare the text passed to the text() method with that which was actually output. XMLWriter escaped the text insuring a well-formed document.

XSL

XSLT support, provided by the XSL extension in PHP 5, allows for the transformation of an XML document into another document. Rather than supporting two different extensions, as was the case with PHP 4, both Sablotron and domxml, providing some XSLT support, were moved to PECL. As a replacement, the XSL extension was created, providing the extensive functionality found in Sablotron, as well as the interoperability between DOM and XSL and the transformation speed from domxml. All of this made possible due to the use of the libxslt library.

What really sets this extension apart from its predecessors is its ability to natively work with PHP streams and its extendability by being able to register and then call PHP functions from a stylesheet during a transformation. Using the articles.xml file and the stylesheet containing

<?xml version="1.0" encoding="iso-8859-1" ?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                                                xmlns:php="http://php.net/xsl">
   <xsl:output method="text" />
   <xsl:template match="article">
      <xsl:value-of select="php:function('whoami', string(author))" />
   </xsl:template>
   <xsl:template match="/">
      <xsl:apply-templates select="article"/>
   </xsl:template>
</xsl:stylesheet>

The following code illustrates how a PHP function can be called during transformation:

<?php
function whoami($name) {
   return "I am $name";
}

$doc = new DOMDocument();
$doc->load('article.xml');

$xsl = new DomDocument();
$xsl->load("arttrans.xsl");

$proc = new XsltProcessor();
$proc->registerPhpFunctions();

$xsl = $proc->importStylesheet($xsl);
print $proc->transformToXML($doc);
?>

Which will output:

I am Rob Richards

CONCLUSION

A lot has changed from the old PHP 4 days. XML, while not the answer to everything, is no longer something a developer might only occasionally run into, if ever, during their careers. It is now common place to work with RSS feeds, web services and/or mashups, thus needing to work with XML data. While possible to do using PHP 4, it typically comes at a price. The tools available are either difficult to learn or use, wasting a developers valuable time that can be put to use elsewhere, or system resource intensive, requiring a good amount of hardware to sustain a decent performance level. In all cases, a company’s bottom line is directly affected.

When dealing with XML based applications, not only does it financially make sense to upgrade to PHP 5, but the new and advanced feature set makes working with XML easy and also allows for more rich applications. The interoperability between extensions allows you to leverage the simplicity of SimpleXML to access a document and switch to DOM for more complex operation on the same in-memory document. This document could then be passed to XSL, again not needing to make a copy, which could then be transformed. During transformation, PHP functions could be called that might use XMLReader to locate a specific subtree within an extremely large XML document. This subtree might be expanded into a DOM tree and returned to the calling stylesheet for further processing. The ability to perform operations like this are just not possible in PHP 4 and should make you seriously think about moving to PHP 5 if you are serious about working with XML.