XML and PHP 5

July 31, 2007

Tutorials

Unless this is your first exposure to PHP, it is probably safe to assume that everyone by now has seen the PHP 5 release announcements or at least have heard that XML support has been given a complete overhaul. I am still surprised with those I talk to who are still developing with PHP 4 and actively working with XML, rather than upgrading to PHP 5 to use the new XML tool sets. I guess it might be a bit scary for them to think about such an upgrade. Both ext/xslt and ext/domxml have been removed from the core PHP distribution and replaced with completely new extensions, resulting in the need to review and typically re-write all older code relying on either of those extensions in order to upgrade.

What most people don’t realize is that upgrading the code is really not all that difficult. In fact, the benefits of doing so, even for those not yet developing with XML or services, are significant. When an upgrade can provide a developer with simpler tool sets, faster performance and better utilization of system resources, business-wise, it just makes sense to upgrade, as it ultimately affects both productivity and the bottom line for the company. Personally, the benefits from the new XML tools were so great that I went into production on a few company servers, hosting some of our XML based applications, while PHP 5.1 was still in beta.

BACKGROUND

As of the PHP 5.2.x releases, there are a good number of extensions for working with XML data in a variety of ways. These can be broken down into the following categories: tree based, streaming, event based and transformation. Tree based parsers, encompassing ext/dom and ext/simplexml, work on an XML document in-memory allowing it to be manipulated in any manner. Streaming APIs, such as ext/xmlreader and ext/xmlwriter, allow for XML to be read or written to/from PHP streams, resulting in very low memory usage but providing very focused and uni-directional XML support. Event based parsers, such as the familiar ext/xml from PHP 4, view an XML document as events that are fired off as different portions of the document are accessed. Lastly, the XSL extension provides support for transforming an XML document to another XML document.

The goal for XML support in PHP 5 was not only to provide a solid base of tool sets for working with XML, but also to provide some unity amongst the tool sets themselves. The libxml2 and libxslt libraries, from the GNOME project, were chosen as the foundation for the XML support. Not only are these some of the fastest libraries, but as they work together, it became possible to allow different XML based extensions to inter-operate with little to no additional overhead. For example, under PHP 4, one might build an XML data tree using domxml and perform XSL transformations using the xslt extension. In order to pass the data, the XML tree from ext/domxml would need to be serialized to either a string or a file in order to be used by ext/xslt. Working with XML is already time and memory consuming enough. These additional required steps just compound the problem and ultimately decrease the number of users your hardware can serve at a time. In short, it can cost your company money, not only due to the time developers in your company might spend trying to optimize code to help alleviate the situation, but the additional overhead most likely resulted in the need for better and possibly additional hardware.

Things are a bit different using XML in PHP 5. In a similar scenario, you could use one of the new native extensions, like DOM or SimpleXML, to manipulate the XML data tree. This tree can then be passed directly to and used by the XSL extension without incurring any additional overhead. There is no serialization nor copying involved; ext/xsl is able to work directly with the XML tree it is given, resulting in a significant improvement in use of system resources as compared to coding a similar task in PHP 4. So now that you have an idea why XML support got a face lift, you might like to know about the different tools available to you.

DOM

The XML revolution in PHP began with ext/dom. The domxml extension in PHP 4 was plagued with a number of problems; one of the worst being that until the time of the 4.3.x releases, when the extension finally conformed more to the W3C DOM specifications, the API was constantly changing. Granted ext/domxml never officially made it out of experimental status until 4.3.10, this simple fact should make one think twice if they happen to still be actively using domxml. In any event, prior to the 4.3 releases, it was always a question whether upgrading to a newer version of PHP would result in breakage of existing code relying on domxml. Memory leaks and high memory usage were just other problems that seemed to dog ext/domxml. While many of these issues have been addressed over the years to some degree, in whole, they were all just too great to fix while also maintaining backwards compatibility, so it was re-written from scratch, making sure all prior problems and issues were addressed from the start, and finally emerged as ext/dom in PHP 5. A major plus for the new extension is that from the beginning, it followed and conformed to W3C DOM specifications, allowing developers coming from other languages to easily start writing DOM code without having to learn a new API.

Earlier on, I had mentioned that it is often not too difficult to migrate code from ext/domxml to use ext/dom. The biggest catch here is that the code must have been written using the PHP 4.3.x W3C DOM conforming functions. Using the document in Listing 1 as the base XML document, take a look at the PHP 4 code in Listing 2, using ext/domxml, and the PHP 5 code in Listing 3, using ext/dom.

Listing 1

<?xml version="1.0"?>
<article>
   <name>XML in PHP 5
   <author>Rob Richards
</article>

Listing 2

<?php
$doc = domxml_open_file('article.xml');
$root = $doc->document_element();
$node = $root->first_child();

while ($node) {
   if (($node->node_type() == XML_ELEMENT_NODE) && 
      ($node->node_name() == 'name')) {
      	$content = $node->first_child();
            $output = $content->node_value();
            print "Output: $output
";
      	break;
      }
   $node = $node->next_sibling();
}
?>

Listing 3

<?php
$doc = new DOMDocument();
$doc->load('article.xml');
$root = $doc->documentElement;
$node = $root->firstChild;

while ($node) {
   if (($node->nodeType == XML_ELEMENT_NODE) && 
      ($node->nodeName == 'name')) {
      	$content = $node->firstChild;
            $output = $content->nodeValue;
            print "Output: $output
";
      	break;
      }
   $node = $node->nextSibling;
}
?>

Comparing the code from the two listings, you should notice that the code is quite similar. The majority of code I have seen using ext/domxml in fact only uses a handful of functions. Upgrading the code to ext/dom, assuming the W3C based functions were used, typically involves a simple removal of the underscore, a change to use camelCase, and a quick check to determine if the function is now a property. For instance, $node->next_sibling() in ext/domxml becomes $node->nextSibling in ext/dom. I have found that around 75% of the cases I run into fall into this category, which once a developer gets comfortable making these, with the aid of some search/ replace tools, can upgrade even large ext/domxml based code sets in under a day.

SimpleXML

DOM allows a developer to access and manipulate XML in any way needed, but it comes at a price. It is a large and complex API, requiring a developer to really understand all the intricate details of working with XML and in fact scares many beginners from even wanting to work with XML. SimpleXML aims to break through all the XML complexities and provide an intuitive and simple, hence the name, API to work with a document.

The vast majority of people working with XML are really only concerned with elements having simple content and maybe the occasional attribute. Instead of trying to wrap your head around trees, nodes and the plethora of methods just to work with this small subset, SimpleXML takes an easier approach and views a document as an object. Elements are represented as properties and attributes as accessors. Using the document from Listing 1, SimpleXML can perform the same functionality as the DOM code from Listing 3, yet in much clearer and compact syntax, as demonstrated in the following example:

<?php
$sxe = simplexml_load_file('article.xml');
print "Output: ".$sxe->name."
";
?>

What took around 20 lines of code using ext/dom can be performed in 2 lines with SimpleXML.

This extension is also a blessing for developers consuming REST based services. In only a few lines of code a service can be called and the results accessed. In the following example, SimpleXML is used to query the Yahoo! Search web search service for the first matching result. Once it has made the call, the title and url to the match are output.

<?php
$terms = urlencode('php 5 xml new');
$url = 'http://api.search.yahoo.com/WebSearchService/V1/webSearch';
$query = '?appid=demo&query='.$terms.'&results=1';

$serviceurl = $url.$query;

$results = simplexml_load_file($serviceurl);

print $results->Result->Title."
";
print $results->Result->DisplayUrl."
";
?>

Which will output:

Zend Developer Zone | XML in PHP 5 - What's New?
www.zend.com/php5/articles/php5-xmlphp.php

XMLReader

Working with XML is typically memory intensive and slow. For these reasons, when only requiring read-only access to a document, ext/xml, the event based parser, has usually been a developers first choice of APIs. Because only portions of a document reside in memory at a time, and events fired off as the document is read, system resource usage is minimal while at the same time allowing document access as it is parsed rather than after the entire document has been parsed. In order to reap these benefits, a developer needs to deal with the limitations of ext/xml. The document cannot be validated as it is parsed, a developer needs to understand how to code using callbacks tied to events and lastly, there is limited namespace support. The XMLReader extension, in my opinion, is the ultimate replacement for ext/xml. Not only do you get the same benefits that ext/xml has to offer, but none of the drawbacks.

Working with XMLReader couldn’t be any simpler. There is no need to deal with callbacks and mapping functions. You as the developer is in control of the forward movement through the document and what information gets accessed. Simply set the input to be accessed and tell the reader when and where to move. It is so simple that an an entire document can be accessed in only a few lines of code.

<?php 
$reader = new XMLReader();
$reader->open('article.xml');

while ($reader->read()) {
   if ($reader->nodeType == XMLREADER::ELEMENT) {
      print $reader->localName."
";
   }
}
?>

Resulting in the following output:

article
name
author

In addition to its simplicity, it also provides advanced features such as document validation via DTD, RelaxNG and XML Schemas. When performing validation in DOM using XML Schemas, you are forced to first load the entire document and then check the validity. With XMLReader, you have the ability to determine the validity of a document while it is being parsed, giving you the option to stop further parsing in the event it fails along the way.

Probably the most advanced feature is the ability to retrieve entire subtrees during parsing. As XMLReader is a streaming parser, normally you only have access to the small piece of the document currently in memory. XMLReader also provides the capability to inter-operate with DOM by allowing the subtree at the current location to be expanded into a DOM tree, allowing DOM operations to be performed, such as appending portions of the document into another DOM based document. The expanded subtree is only a copy so there are limitations to this feature.

XMLWriter

The flip side to XMLReader is XMLWriter. Have you ever wanted to find a simple and intuitive way to create XML documents that at the same time insures the document is well formed? XMLWriter was created for this specific purpose. It is light weight and will stream the document to the desired output destination, thus keeping memory usage low just like XMLReader.

<?php
$writer = new XMLWriter();
$writer->openURI('php://output');
$writer->startDocument("1.0");
$writer->startElement("example");
$writer->startElement("specchars");
$writer->text('&');
$writer->endDocument();
$writer->flush();
?>

<?xml version="1.0"?>
<example><specchars>&</specchars></example>

From the output you can see that not only did XMLWriter make sure that the document is structurally sound by properly closing all open elements, but it also took care of properly escaping the data. One of the biggest problems I see people running into is the use of an ampersand character “&”. In it’s raw state, this character is invalid in XML so must be escaped “&”. Compare the text passed to the text() method with that which was actually output. XMLWriter escaped the text insuring a well-formed document.

XSL

XSLT support, provided by the XSL extension in PHP 5, allows for the transformation of an XML document into another document. Rather than supporting two different extensions, as was the case with PHP 4, both Sablotron and domxml, providing some XSLT support, were moved to PECL. As a replacement, the XSL extension was created, providing the extensive functionality found in Sablotron, as well as the interoperability between DOM and XSL and the transformation speed from domxml. All of this made possible due to the use of the libxslt library.

What really sets this extension apart from its predecessors is its ability to natively work with PHP streams and its extendability by being able to register and then call PHP functions from a stylesheet during a transformation. Using the articles.xml file and the stylesheet containing

<?xml version="1.0" encoding="iso-8859-1" ?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                                                xmlns:php="http://php.net/xsl">
   <xsl:output method="text" />
   <xsl:template match="article">
      <xsl:value-of select="php:function('whoami', string(author))" />
   </xsl:template>
   <xsl:template match="/">
      <xsl:apply-templates select="article"/>
   </xsl:template>
</xsl:stylesheet>

The following code illustrates how a PHP function can be called during transformation:

<?php
function whoami($name) {
   return "I am $name";
}

$doc = new DOMDocument();
$doc->load('article.xml');

$xsl = new DomDocument();
$xsl->load("arttrans.xsl");

$proc = new XsltProcessor();
$proc->registerPhpFunctions();

$xsl = $proc->importStylesheet($xsl);
print $proc->transformToXML($doc);
?>

Which will output:

I am Rob Richards

CONCLUSION

A lot has changed from the old PHP 4 days. XML, while not the answer to everything, is no longer something a developer might only occasionally run into, if ever, during their careers. It is now common place to work with RSS feeds, web services and/or mashups, thus needing to work with XML data. While possible to do using PHP 4, it typically comes at a price. The tools available are either difficult to learn or use, wasting a developers valuable time that can be put to use elsewhere, or system resource intensive, requiring a good amount of hardware to sustain a decent performance level. In all cases, a company’s bottom line is directly affected.

When dealing with XML based applications, not only does it financially make sense to upgrade to PHP 5, but the new and advanced feature set makes working with XML easy and also allows for more rich applications. The interoperability between extensions allows you to leverage the simplicity of SimpleXML to access a document and switch to DOM for more complex operation on the same in-memory document. This document could then be passed to XSL, again not needing to make a copy, which could then be transformed. During transformation, PHP functions could be called that might use XMLReader to locate a specific subtree within an extremely large XML document. This subtree might be expanded into a DOM tree and returned to the calling stylesheet for further processing. The ability to perform operations like this are just not possible in PHP 4 and should make you seriously think about moving to PHP 5 if you are serious about working with XML.

6 Responses to “XML and PHP 5”

  1. khanrashed110 Says:

    You make a valid point but the reason why I haven’t switched over to PHP 5 is because PHP 6 is already around the corner and is no doubt going to bring about some new changes which I will have to learn about. So I would spend a great deal of time and money learning PHP 5 and then converting my code and as soon as it is done, there will be a similar article to this one somewhere on the Internet telling me that I should ditch PHP 5 and go for PHP 6. To be honest I’m coming to a stage where I don’t enjoy having to learn more things so I’m not enthusiastic about having to learn about the new additions made to PHP 5/6 (I know its surprising for a developer to say this). I just spent a whole week going through a simple lesson about <a href= "http://www.liquid-technologies.com/Tutorials/XMLDataBinding/Xml-Data-Binding-Tutorial.aspx"&gt; XML Data binding </a> which would normally have taken me a day to learn when I was younger. It is because of this reason that I decided to skip a step by not to upgrading to PHP 5 and just upgrading to PHP 6 when it becomes more established.

  2. _____anonymous_____ Says:

    I have only thanks. Was stuck, and this totally helped me rewrite PHP4 to 5! Thanks a bunch for having taken the time to put this together.

  3. rrichards Says:

    truethermo:
    That is probably due to the article formatting. It actually outputs &amp; not just &.
    Stu:
    Moving from DOMXML to DOM really does not take that long. I have migrated code bases that large in roughly a days worth of work (this was migrating from OOP usage of domxml). You are correct in the fact that testing is time consuming. This, however, would be the case in any significant upgrade of PHP. I havent worked for anyone who never underwent a decent QA cycle even when upgrading minor versions.

    I’m not going to argue your points on XSL, as you pretty much are right on the money there. I would say that the amount of changes required to fix/alter stylesheets really depends upon some of the syntax used within. In many cases, though defeinitely not all, none is required.

    Michelangelo and all:
    Of course there are barriers preventing and/or slowing people from moving from 4 to 5. For those where time is the factor, the cost / benefits needs to be examined. Time spent upfront upgrading can often lead to lower costs down the line (faster development time, fewer hardware resources, etc..). This all depends upon the situation, so I wouldn’t rule it out just because of time without thinking about the other factors.

    Rob

  4. michelangelovandam Says:

    Hi there,

    I want to comment on the question mentioned in the beginning of the article. The reason I stick with PHP4 is that
    1. My hosting provider doesn’t provide PHP 5 (yet).
    2. To convert application libraries takes time, and time is a sparse luxury.

    For my customers I use the full power of PHP 5 and it’s new XML extensions and I love it. Using these extensions with Zend Framework makes building web applications a piece of cake.

    But now with the EOL announcement of PHP4, we’ll have to upgrade. So if someone has an excellent idea for doubling time per day… let me know, I can use it.

    Yours sincerely,

    Michelangelo van Dam

  5. stuherbert Says:

    This article overlooks the significant cost of upgrading any large (100,000 lines or more) XML & XSL heavy application from PHP 4 to PHP 5 :(

    As the article says, moving from DOMXML to the new DOM extension isn’t difficult. What the article fails to mention is that the main costs come in the amount of time it takes to perform and test the upgrade, plus the need to maintain two separate versions of your application for a while (not everyone of your customers will move to PHP 5 at exactly the same time!), and also the lost money that comes from developers not working on chargeable items because they are performing the upgrade.

    Switching from Sablotron to libxslt is a small change on the PHP side, but a larger change on the XSL side, as Sablotron and libxslt are not 100% compatible with the XSL that they interpret.

    The good news is that, by making libxslt available in PHP 4.3 several years ago, the PHP developers have given businesses every opportunity to minimise their upgrade costs. Folks who find themselves with a lot of XSL to convert from Sablotron to libxslt between now and the end of the year only have themselves to blame for those costs!

    Best regards,
    Stu

  6. truethermo Says:

    The article writes about how the XMLWriter outputs &amp; for &, but the example output is still a simple ampersand.