Advanced Simplicity
• Namespaces
• Searching, Splitting, Recursing
• Edge Conditions
Summary
About the Author
Introduction
"Simplicity of character is no hindrance to the subtlety of intellect."
- John Morley
When people ask me "What is SimpleXML?" I often quip, "XML is the solution to all your problems; SimpleXML ensures it isn't the root of your problems!"
Those of you who have parsed XML with PHP4, or are currently dealing with XML parsing in PHP4, know that it can indeed be very painful to handle documents with any degree of complexity. You either need to use the SAX approach and write a handwritten parser for every document, or you need to use the DOM extension; which (in addition to its tendency to crash, leak and generally misbehave under heavy usage) involves the pain of processing documents using an API designed for a heavily object oriented language and targeted at supporting every single one of XML's idiosyncrasies.
Consider the following small XML
snippet, which describes a small collection of books in
XML format. The document has a root node of library, with
a direct child of shelf, which classifies the books as fiction.
The shelf displayed has two children() labelled
book; "Of Mice and Men" by John Steinbeck and
"Harry Potter and the Philospher's Stone" by J.K. Rowling.
<?xml version="1.0"?>
<library>
<shelf id="fiction">
<book>
<title>Of Mice and Men</title>
<author>John Steinbeck</author>
</book>
<book>
<title>Harry Potter and the Philosopher's Stone</title>
<author>J.K. Rowling</author>
</book>
</shelf>
</library>
The document itself is simple enough: you can see the structure very clearly, and you can understand the path you need to follow to access that information.
Now, before we get into why SimpleXML will change your life, let's first look at how one would parse this document using DOM:
<?php
$doc = new domDocument();
$doc->load('library.xml');
$library = $doc->documentElement;
$shelves = $library->childNodes;
foreach ($shelves as $shelf) {
if ($shelf instanceof domElement) {
process_shelf($shelf);
}
}
function process_shelf($shelf)
{
printf("Shelf %s\n", $shelf->getAttribute('id'));
$books = $shelf->childNodes;
foreach ($books as $book) {
if ($book instanceof domElement) {
process_book($book);
}
}
}
function process_book($book)
{
foreach ($book->childNodes as $child) {
if (! ($child instanceof domElement)) {
continue;
}
foreach($child->childNodes as $element) {
$content = trim($element->nodeValue);
switch ($child->tagName) {
case 'title':
printf("Title: %s\n", $content);
break;
case 'author':
printf("Author: %s\n", $content);
break;
 }
}
}
}
?>
As you can see, it takes 47 lines of well-crafted PHP code - with no error checking- to manipulate and print out a list of the books within the XML file. With error checking, comments and other things you might find add in the real world, it could easily take 70-80 lines of code to parse this straightforward, simple XML document.
Contrast the example above with the following piece of code that uses the SimpleXML extension to access the same document, and print out the exact same information.
<?php
$library = simplexml_load_file('library.xml');
foreach ($library->shelf as $shelf) {
printf("Shelf %s\n", $shelf['id']);
foreach ($shelf->book as $book) {
printf("Title: %s\n", $book->title);
printf("Author: %s\n", $book->author);
}
}
?>
With SimpleXML, element names are automatically mapped to properties
on an object, and this happens recursively. Attributes are mapped to
iterator accesses. All of this happens "on-demand," using Zend Engine 2's
new object overloading features. SimpleXML's "low-fat" approach to XML
parsing reduced the code size of this example from 47 lines of code, to
a mere 10. Furthermore, the code is considerably more readable:
instead of using statements like foreach($child->childNodes as $element)
to access the element node of an XML child, you simply reference it by name.
Advanced Simplicity
In a perfect world all XML documents, and the information you needed to extract from them, would be as basic as the example given above. In fact this is true in many cases: configuration files, basic data export, and basic serialization all require parsing capabilities no greater than the above example. There are, however, some cases where the basic functionality listed above simply isn't suitable.
Namespaces
One issue that SimpleXML encountered was XML namespaces. XML documents allow you to hide tags away into a labelled section called a namespace. SimpleXML originally solved namespaces by simply adding another level of indirection:
<?xml version="1.0"?>
<entries xmlns:blog="http://www.edwardbear.org/serendipity/">
<blog:entry>
<blog:name>RPROF - Regular Expression Profiler</blog:name>
</blog:entry>
<blog:entry>
<blog:name>Advanced PHP Programming</blog:name>
</blog:entry>
</entries>
To print out the names of all the different blog entries you
could write the following code:
<?php
$entries = simplexml_load_file('syndic.xml');
foreach ($entries->blog->entry as $entry) {
printf("%s\n", $entry->name);
}
?>
blog) is just a simple alias with no
particular relevance. The significant portion of a namespace is the
URL (http://www.edwardbear.org/serendipity/),
which is what people who parse XML documents should rely
upon.
Therefore, the approach SimpleXML takes to supporting multiple
namespaces is not to add any changes to the way you access properties,
but rather to give you two methods: attributes() and children(). The
children() function returns all the children() of an XML node in a given
namespace. If no namespace is passed to the children() function, all
the elements in the global namespace are returned.
The example given above is properly parsed with the following bit of code:
<?php
$entries = simplexml_load_file('syndic.xml');
foreach ($entries->children('http://www.edwardbear.org/serendipity/') as $entry) {
printf("%s\n", $entry->name);
}
?>
Note: You may also pass the qualified name to the
children() or attributes() method so they will check for that as well,
but this is not recommended.
Searching, Splitting, Recursing
The other way that SimpleXML didn't really address the needs of people developing XML applications was that, while it provided a nice way to algorithmically process a document, it didn't provide any features for performing common searches and accesses. For example, how does one access all descendants of a given node? How can you search a document, and find a tag and a value that both match a given condition? There are many common operations on XML documents that are a pain to write by hand, and desperately need simplification.
As a solution to this problem, SimpleXML doesn't re-invent the
wheel, but instead provides the xpath() method, which allows you to perform W3C
standard Xpath queries on an XML document. A problem like getting all
descendants of a given node turns into a highly optimized Xpath query
//children(). While the full scope of Xpath is well
outside the scope of this document, it is recommended that anyone
serious about processing XML should learn to use the Xpath language,
which is as important to XML as Regular
Expressions are to plain text.
Edge Conditions
While SimpleXML is a great tool for processing XML, its simplicity does come with a few drawbacks. Most notable among these is that processing mixed XML and text content with SimpleXML is very hard. For example, consider the following XML
<?xml version="1.0"?>
<flaw>
<blurb>
This <italic>is</italic> some sample <bold>text</bold> where
SimpleXML <underline>will</underline> not behave well.
</blurb>
</flaw>
Accessing $document->blurb with print_r() or
var_dump() would return an element iterator that
contained the contents of italic, bold, and underline.
It would not, however, return the text surrounding those elements. This is because when given
the choice between mixed elements and contents, SimpleXML will always choose to return the
elements, and ignore the contents, of a particular tag.
SimpleXML has two solutions to this problem built into the library.
Firstly, a method called asXML() is provided, which
will take the given node and serialize its contents, as well as the
contents of all its children(), to either a file or a string.
With the example above, you would call
$document->blurb->asXML() and it would return the full contents
of the blurb node in a format suitable for printing or
further processing.
The second solution is to bypass SimpleXML for certain portions of your document. One of the explicit design goals of PHP5's XML support was to allow all extensions to interoperate at a minimal cost. Since LibXML2 is the lingua franca of all XML extensions, DOM and SimpleXML objects can be exchanged with zero copies. It's just a different way of viewing the same underlying object! By this method, the DOM extension can "import" SimpleXML objects and use them as DOM objects, and vice versa. When you need to use a DOM feature you can, and when you need SimpleXML's ease of processing, you have that too.
Summary
PHP5's new XML support was designed as a coherent set of APIs to process and manipulate XML. This includes the DOM extension, which provides all you'll ever need for handling XML, the SAX API for streaming XML parsing, XSLT for XML transformations and SimpleXML when you need to do anything else.
Be fruitful and multiply!
About the Author
Sterling Hughes is a PHP core developer and the chief instigator of the SimpleXML extension for PHP 5. His earlier contributions include the ADT, cURL, XSLT, and Mono extensions. He works as a freelance Web developer, creating dynamic Web applications in PHP, C and Perl, and is also the co-author of PHP Developer's Cookbook.
Sterling can be contacted at sterling@apache.org.

Comments
$node->{"namespace:tag"}
As this would make SimpleXml brilliant to use. Currently I end up with similar amounts of code to DOM when namespaces are involved.
This would have several uses...
Just a thought :>)
When trying to convert this using simplexml_load_string(...) I am getting false/null as result although the string contains definitely legal XML.
I suspect one needs to fiddle with the optional namespace parameter here but the documentation is severely lacking. Could some kind soul feel tempted to explain the optional "namespace" and "is_prefix" parameters of the simplexml_load_string(...) and simplexml_load_file(...) methods? What exactly are these arguments meant for andor what do they control? What types does one have to pass here? Could someone maybe provide an example?
Michael
<?php
$soap_request_string = <<<XML
<?xml version="1.0" encoding="UTF-8"?>
<env:Envelope xmlns:env="http://www.w3.org/2003/05/soap-envelope" xmlns:ns1="urn:Gateway_Proxy" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:enc="http://www.w3.org/2003/05/soap-encoding">
<env:Body>
<ns1:make_proxy_payment env:encodingStyle="http://www.w3.org/2003/05/soap-encoding">
<payment_id>61ecc268-1cd0-f468</payment_id>
<payment_amount>15495</payment_amount>
<callback_query_string>&payment_id=61ecc268-1cd0-f468</callback_query_string>
<transaction_note>Order from Student Library Fees with Payment Id: 61ecc268-1cd0-f468</transaction_note>
</ns1:make_proxy_payment>
</env:Body>
</env:Envelope>
XML;
$xml_element = new SimpleXMLElement($soap_request_string, NULL, false, 'http://www.w3.org/2003/05/soap-envelope');
$name_spaces = $xml_element->getNamespaces(true);
//print_r($name_spaces);
foreach ($xml_element->children($name_spaces['env']) as $body)
{
//printf("%s<br />", $body->getName());
foreach ($body->children($name_spaces['ns1']) as $function)
{
printf("%s<br />", $function->getName());
foreach ($function->children() as $parameters)
{
printf("%s => "%s"<br />", $parameters->getName(), $parameters);
}
}
}
?>
Example: Let's say that default document's namespace is Z.
In this xml:
<document xmlns:blog="http://www.foobar.com/blog">
<blog:entry author="John">Hello world</blog:entry>
</document>
I'd expect that attribute "author" has the namespace "http://www.foobar.com/blog". However, it has the document's default Z namespace. I need to change this document to:
<document xmlns:blog="http://www.foobar.com/blog">
<blog:entry blog:author="John">Hello world</blog:entry>
</document>
But it doesn't seem clear to me. I think that default attribute's namespace should be the one from the tag, and not for the whole document.
Regards,
--
Andres P. Ferrando
http://www.pruna.com.ar
instead of it this had worked fine:
foreach ($xml->row as $row){
foreach ($row->xpath("*") as $node){
echo($node->getName());
echo ": ";
echo($node);
echo "\n";
}
}