Scrape Screens with zend-dom

Even in this day-and-age of readily available APIs and RSS/Atom feeds, many sites offer none of them. How do you get at the data in those cases? Through the ancient internet art of screen scraping.

The problem then becomes: how do you get at the data you need in a pile of HTML soup? You could use regular expressions or any of the various string functions in PHP. All of these are easily subject to error, though, and often require some convoluted code to get at the data of interest.

Alternately, you could treat the HTML as XML, and use the DOM extension, which is typically built-in to PHP. Doing so, however, requires more than a passing familiarity with XPath, which is something of a black art.

If you use JavaScript libraries or write CSS fairly often, you may be familiar with CSS selectors, which allow you to target either specific nodes or groups of nodes within an HTML document. These are generally rather intuitive:

jQuery('section.slide h2').each(function (node) {
  alert(node.textContent);
});

What if you could do that with PHP?

Introducing zend-dom

zend-dom provides CSS selector capabilities for PHP, via the Zend\Dom\Query class, including:

  • element types (h2, span, etc.)
  • class attributes (.error, .next, etc.)
  • element identifiers (#nav, #main, etc.)
  • arbitrary element attributes (div[onclick="foo"]), including word matches (div[role~="navigation"]) and substring matches (div[role*="complement"])
  • descendents (div .foo span)

While it does not implement the full spectrum of CSS selectors, it does provide enough to generally allow you to get at the information you need within a page.

Example: retrieving a navigation list

As an example, let’s fetch the navigation list from the Zend\Dom\Query documentation page itself:

use Zend\Dom\Query;

$html = file_get_contents('https://docs.zendframework.com/zend-dom/query/');
$query = new Query($html);
$results = $query->execute('ul.bs-sidenav li a');

printf("Received %d results:\n", count($results));
foreach ($results as $result) {
    printf("- [%s](%s)\n", $result->getAttribute('href'), $result->textContent);
}

The above queries for ul.bs-sidenav li a — in other words, all links within list items of the sidenav unordered list.

When you execute() a query, you are returned a Zend\Dom\NodeList instance, which decorates a DOMNodeList in order to provide features such as Countable, and access to the original query and document. In the example above, we count() the results, and then loop over them.

Each item in the list is a DOMNode, giving you access to any attributes, the text content, and any child elements. In our case, we access the href attribute (the link target), and report the text content (the link text).

The results are:

Received 3 results:
- [#querying-html-and-xml-documents](Querying HTML and XML Documents)
- [#theory-of-operation](Theory of Operation)
- [#methods-available](Methods Available)

Other uses

Another use case is for testing. When you have classes that return HTML, or if you want to execute requests and test the generated output, you often don’t want to test exact contents, but rather look for specific data or fragments within the document.

We provide these capabilities for zend-mvc applications via the zend-test component, which provides a number of CSS selector assertions for use in querying the content returned in your MVC responses. Having these capabilities allows testing for dynamic content as well as static content, providing a number of vectors for ensuring application quality.

Start scraping!

While this post was rather brief, we hope you can appreciate the powerful capabilities of this component! We have used this functionality in a variety of ways, from testing applications to creating feeds based on content differences in web pages, to finding and retrieving image URIs from pages.

Get more information from the zend-dom documentation.

This article originally appeared on the Zend Framework blog at https://framework.zend.com/blog/2017-02-28-zend-dom.html.