Indexing Web Content with PHP and SWISH-E

      1 Comment on Indexing Web Content with PHP and SWISH-E

Indexing Web Content with PHP and SWISH-E

A Crowded Marketplace

In a previous article, I’d demonstrated how to build a simple email search system in PHP using two indexing tools, Sphinx and the Zend Framework’s Lucene implementation. Now, while these are undoubtedly two of the most popular tools available, open source is all about choice…and so, it should come as no surprise that there exist a number of alternatives, many of them equaling Sphinx and Zend_Lucene in sophistication and speed.

This article deals with one such alternative, SWISH-E aka the Simple Web Indexing System for Humans – Enhanced. As the name suggests, SWISH-E is particularly good at indexing Web content, be it in text, HTML, XML, PDF or DOC format. If you’re trying to add a full-text search engine to your Web site, but don’t really want to spend too much time on configuration and data processing, this might just be the thing you’re looking for. Come on in, and find out more!

Building Blocks

Let’s start at the beginning. What’s SWISH-E all about, anyway?

Like Sphinx and Zend_Lucene, SWISH-E is an open source project that is capable of indexing a wide variety of data sources, and then permitting fast, focused searches on the indexed data. It is available in both binary executable format for Windows platforms and source format for *NIX platforms. It can directly read and index data in plain text, XML and HTML formats, and it comes with a full-featured search utility that allows Boolean and wildcard searches, phrase searches and context searches.

This “format flexibility” can significantly reduce the time spent in putting the indexer to work on a document collection. There are a couple of reasons for this. First, there already exist a large number of open source tools on the *NIX platform to extract sensible textual content from non-ASCII formats like PDF and DOC, and the SWISH-E indexer can directly read the output of these tools, without requiring any special configuration. Second, the SWISH-e release package comes with a variety of scripts that can easily be extended to support other custom formats, as well as full-fledged examples of how to apply the indexer to real-world situations.

The engine does have a couple of limitations. It currently doesn’t support multi-byte character sets, and may not always be as fast as Sphinx at performing complex searches. If either or both of these requirements are necessary for your application, you’ll probably be better off with either Sphinx or Zend_Lucene.

SWISH-E support in PHP comes through PECL’s ext/swish extension, which is maintained by Wez Furling and Antony Dovgal, and provides an object-oriented API for accessing SWISH-E indices. Although this extension doesn’t support creating indices (SWISH-E includes a command-line utility for this), it can still be used for executing different types of queries and presenting the results in various formats.

This article assumes that your development system has the following tools and libraries in place:

Of these, the first two components are required by the SWISH-E build process, and the remainder are needed to run the examples in this article. In case you don’t already have these, you can download and install them from their official sites using the links above; installation instructions are included in the respective source code archives.

Assuming all this is in place, begin by downloading the SWISH-E source archive (v2.4.7) and compile the SWISH-E binaries and libraries for your system. Assuming you’re on a *NIX system, here’s how:

By default, SWISH-E should be installed to your /usr/local/* directory tree.

Once this is done, download, compile and install the latest revision of the PHP SWISH-E extension (v0.4.0-beta) using the PECL automated installer, as below:

You can also accomplish this manually, by downloading the SWISH-E extension source code and installing it with the phpize command:

Whichever method you choose, you should end up with a loadable PHP module named in your PHP extension directory. You should now enable the extension in the php.ini configuration file, restart your Web server, and check that the extension is enabled with a quick call to phpinfo():

Manual Labor

Let’s start with a basic example that illustrates SWISH-E usage from the command line. SWISH-E’s command-line indexer can read source data from either the local file system (“fs”), an HTTP connection (“http”) or from a custom program (“prog”). The specific details of which files or Web sites should be indexed may be specified either on the command line or, more conventionally, in a configuration file.

To illustrate how this works, consider a simple example: indexing a local copy of the PHP manual. In your development directory, create a simple configuration file named swish-e.conf, and fill it with the following directives:

Here, the IndexDir directive tells SWISH-E the root directory to start indexing, while the IndexOnly directive specifies which file types should be indexed. It’s also possible to use regular expression patterns with the FileMatch and FileRules directives, to specify additional files to be included or excluded. Finally, the FuzzyIndexingMode directive specifies the type of index to create; in this case, an index with words normalized to their “stem” form so that searches for “migrate”, “migrating” and “migration” all return the same result.

To create an index using this configuration, invoke the SWISH-E command-line indexer as below, specifying both the configuration file and the access method:

Here’s an example of what you should see as the indexer runs:

The output of the indexing process is two files, named index.swish-e and index.swish-e.prop and saved to the current working directory by default. Once indexing is complete, you can search these generated index files by specifying a search query with the -w command-line parameter. Here’s an example, which searches the index for the string “i18n”:

Here’s an example of the output:

As this output illustrates, the search utility automatically ranks and orders matches before returning them, with the highest-ranked matches appearing at the top. The path, title and file size of each matched document is also included in the result set.

While a full discussion of SWISH-E search syntax is beyond the scope of this article (details here), it’s worthwhile demonstrating how logical operators such as AND, OR and NOT can be used for more precise search queries. Here’s an example of searching for documents containing both “apache” and “windows”:

Here’s what the output might look like:

And here’s an example searching for documents containing “apache” or “iis”:

Here’s what the output might look like:

Next, let’s do the same thing with PHP.

Match Point

The SWISH-E PECL extension provides a full-featured client API that allows developers to query a SWISH-E index and retrieve a list of matches, in much the same way as the command-line client does. These results can then be formatted to present detailed information for each matched document. Here’s an example:

Try this out with a simple command line query, as below:

Here’s an example of the output:

This script begins by initializing a new Swish() object, with the index file name as constructor argument, and then retrieving the query term from the command line using the special $argv array. The query is then executed with the query() method. The return value of this method is a SwishResults object, whose properties hold result and index statistics. Here’s an example of what this array looks like:

It’s now possible to iterate over this SwishResults object and retrieve the individual matches as SwishResult objects. Each of these objects, in turn, exposes properties corresponding to the various properties of the source document, such as the document title, rank, URL, last modified time and size in bytes. Here’s an example of what one such SwishResult object looks like:

Be Prepared

The PECL extension also includes support for prepared search objects, which come in handy when you need to perform multiple queries within the same script. Using a prepared object is typically more efficient than executing the query() method multiple times. To illustrate how this works, consider the following rewrite of the previous script:

Another advantage of using a prepared search object, is that it becomes possible to exert some measure of control over the result set generated, via the following three methods exposed by the search object:

  • The setLimits() method allows you to specify an allowed range of values for the various properties of a match. Only those matches falling within the specified range are included in the result set. For example, you could specify that the result set should only those documents which were modified between June 1-30 2009.
  • The setSort() method allows you to define the property by which results should be sorted. For example, you could re-sort the result date by path or document size, in ascending or descending order.
  • The setStructure() method, which is only applicable to HTML documents, allows you to control which parts of an HTML document are searched. For example, you could specify that only the document title and meta-information should be searched, rather than the entire body. A number of operators can be used with this method, as below (note that operators can be combined using PHP’s bit operators):
    • Swish::IN_TITLE – search the <title> element
    • Swish::IN_HEAD – search the <head> element
    • Swish::IN_BODY – search the <body> element
    • Swish::IN_COMMENTS – search <!– –> comment elements
    • Swish::IN_HEADER – search <h1>, <h2>… header elements
    • Swish::IN_EMPHASIZED – search <em> elements
    • Swish::IN_META – search <meta> elements
    • Swish::IN_ALL – search the entire document (default)

Here’s an example which demonstrates these methods in action, by restricting the results to documents that were modified in the last 24 hours and whose title begins with a letter between ‘e’ and ‘o’, and then further sorting the result by document size, in descending order:

Here’s an example of the output:

Note that if you call setLimit() more than once, the limits specified are combined together using Boolean AND. You can turn things back to their default values by calling resetLimit() as needed.

This next example restricts searches only to the document title:

Here’s an example of the output:

Spider, Spider, On The Wall

One of SWISH-E’s unique features is its built-in “spidering” capability, which significantly reduces the complexity involved in indexing and searching Web site content. While it is possible to use the indexer to directly spider a Web site, a more efficient solution is to use the included Perl script and pipe the output of this script to the indexer through the “prog” access method.

To illustrate how this works, specify the root URL of the Web site to be spidered in the configuration file, as shown below (note that you can spider multiple sites simply by specifying multiple space-separated URLs):

Here, the IndexDir directive tells the indexer the name of the program to use for spidering, while the SwishProgParameters directive specifies additional parameters to be passed to the spider. Note the special “default” parameter, which tells the script to use its default settings.

Save these changes, and then run the indexer from the command-line, remembering to specify the “prog” access method, as below:

Here’s an example of what you will see as indexing takes place:

One of the advantages of using the default settings when spidering Web sites is that if the spider encounters binary files, such as PDF or DOC file, it will automatically read and pipe the content of those files as well to the indexer. This is an easy way to include these non-ASCII formats as well in your index. Note, however, that this only works if the requisite translator program (for example, pdftotext for PDF files or catdoc for DOC files) is already installed on the system.

Once you’ve got the index created, it’s not very difficult to build a PHP-based Web interface to search it for specified terms. Here’s what the code looks like:

Nothing too complicated here. The script sets up a simple input form for search terms and, on submission, creates and executes a query on the index using the SWISH-E API. As before, the SwishResults object is processed in a loop, and the result set is formatted and presented as an HTML list. Since the indexer stores both the document title and URL in the index, it’s quite easy to hyperlink each result to its original source URL as well.

Here’s an example of what the result might looks like:

What’s In A Name?

Like Sphinx and Zend_Lucene, SWISH-E also lets you define your own index fields, and use these to further filter search results. These fields, called “metanames” in SWISH-E lingo, are specified in the indexer’s configuration file, and correspond to element names in the XML or HTML source documents.

Typically, to add metanames to a SWISH-E index, you would specify them as a space-separated list in the indexer’s configuration file, as below:

Assuming your XML document looks something like this, the indexer will then separately index the contents of the <id>, <name> and <genre> elements:

You can then use these metanames within search queries, by including them in the search term. Here’s an example, which searches for documents containing the string “dance” in the <genre> element:

You can continue to use logical operators and wildcards as usual with metanames, as below:

To illustrate how this works in practice, let’s consider a slightly more complex example: indexing email. Email messages are a good example of how metaname indexing can help in search result filtering, because they contain a number of structured attributes, such as the recipient name, sender name, subject and date. I’ll assume here that you have an email message archive stored in the mbox format (the default format used by many email clients, including Mozilla Thunderbird), and that you already have the PEAR Mail_Mbox and Mail_mimeDecode packages installed.

To begin, let’s create a PHP script that will parse a set of mailboxes, and return the messages contained therein as structured XML:

The first part of this script includes the necessary class files, and retrieves the location of the mailbox directory from the command-line arguments supplied to the script via the special $argv array. A RecursiveDirectoryIterator then takes care of iterating over this directory and retrieving its contents.

All the files in the mailbox directory are not necessarily mailboxes. Mozilla Thunderbird stores some mailbox data in .dat files, while each mailbox also has its own summary file. These files are filtered out from processing, as are certain specified mailboxes like Trash and Drafts. The remaining mailbox files are then passed to the Mail_Mbox class, which provides methods to parse their contents, count the number of messages and iterate over the message collection. The Mail_mimeDecode class provides additional utility methods to retrieve individual message headers such as From, To and Subject.

Each message retrieved in this manner is represented as an XML document, with the individual fields of each message represented as XML elements. This document is printed to the standard output device, prefixed with three headers specifying the document path, size in bytes and last modified time. These headers are used by the indexer, and correspond to the ‘swishdocpath’, ‘swishdocsize’ and ‘swishlastmodified’ properties respectively. Be sure to ensure that the document size specified in the Content-Length header is accurate, as the indexer uses this to determine the start and end of the document content, and will fail with errors if it’s not able to locate the next set of headers at the specified ending byte value.

This script can be run at the command-line by passing it the name of the mailbox directory as argument:

Here’s an example of one such XML document generated by this script:

Thus, the output of the previous script is a series of XML documents, representing the messages to be indexed. The indexer can read and index this output, once it’s correctly configured. Let’s do that next, by updating the indexer’s configuration file with the following directives:

As explained earlier, the MetaNames directive specifies the XML elements to index. The PropertyNameAlias directive tells the indexer to use the <title> element for the internal ‘swishtitle’ property; this property is commonly displayed in the search results generated by SWISH-E. Similarly, the StoreDescription directive tells the indexer to store 40 characters of the <h_subject> XML element as the internal ‘swishdescription’ property. The IndexContents and DefaultContents directives tell the indexer to treat the source documents as XML.

Save these changes, and then run the indexer from the command-line, remembering to specify the “prog” access method, as below:

Here’s an example of what you will see as indexing takes place:

Once the index has been created, you can run a few test queries on it. For example, here’s the result of a search for messages containing the word “recipe” in their subject line:

And here’s an example of a search for messages containing the word “london” in the message body, with the search limited to the “Travel” mailbox:

Mail Run

It’s also quite easy to now build a PHP-based search interface to this index. Here’s some illustrative code:

This script creates an input form, with fields for the user to enter date, recipient and subject search terms. It then formulates this input into a metaname-based query string, prepares a search object, and executes the query. The results are displayed as a list, sorted by date.

Here’s what the search form looks like:

And here’s an example of the search results generated by this script:

As these examples illustrate, SWISH-E provides a convenient, easy to configure toolkit for indexing and searching content, be it in text, HTML or XML format. SWISH-E’s built-in support for Web spidering, its flexible XML input format, and its support for third-party converters can all significantly reduce the time and effort required to create full-text indices for a Web site or application. And when it comes to searching these indices, PECL’s ext/swish provides a full-featured API that can be used to build both simple and complex search interfaces.

I hope you found this article interesting, and that it encouraged you to give SWISH-E a try the next time you’re in the market for a quick-and-dirty full-text search system. Until next time…happy coding!