Indexing Email Messages with PHP, Zend Lucene and Sphinx

Pack Rat

I tend to be a bit of a pack rat when it comes to email. I like to keep most of the email I receive, apart from the obvious spam, on the off-chance I might need it at some time in the future. The only downside: searching for messages matching specific keywords is often a time-consuming and frustrating process, prone to errors and subject to the limitations of each mail client’s search engine.

By coincidence, I recently tried Sphinx, the open-source indexing system, in another project and thought it would be an interesting experiment to use Sphinx to index my email messages. And that’s where this article comes in. Over the next few pages, I’ll run you through what I did and the steps I took to index a large email collection with Sphinx. I’ll also try the same thing with another very common text search engine, the Zend Framework’s Lucene implementation, in order to compare and contrast the differences between the two approaches. So come on in and let’s get started!

Putting The Pieces Together

I’ll assume that you already have an Apache/PHP/MySQL development environment, and jump right into installing the pieces necessary for the scripts in this article. To begin, download and install the PEAR Mail_Mbox (v0.5.1) and Mail_mimeDecode (v1.5.0) packages (and any dependencies) using the PEAR automated installer, as shown below:

shell> pear install Mail_Mbox-beta
shell> pear install Mail_mimeDecode

These packages make it possible to easily parse mailboxes that are stored in the mbox format (the default format used by many email clients, including Mozilla Thunderbird).

The Zend Framework includes a port of the Apache Lucene project that can be used to add full-text search capabilities to any PHP application. This port, known as Zend_Search_Lucene, is included in v1.6 and better of the Zend Framework, and it can be used to both index various document types (including text, HTML and some Microsoft Office 2007 formats) as well as perform different types of search queries on the indexed data.

Since Zend_Search_Lucene is implemented as a set of PHP classes, you don’t need to recompile PHP to add support for it. To get it working, simply download and install the Zend Framework (v1.8.4) using the standard installation instructions, making sure to add the final location of the /library directory to your PHP include path.

Sphinx is an open-source full-text indexer and search engine that is in use on some of the Internet’s most content-rich site including craigslist.org, mysql.com and wikimapia.org. It integrates directly with MySQL and PostgreSQL, but it can also index other sources of data so long as they are encoded in a specified XML format. It is released as both a set of client API libraries and a set of binary utilities that can be used to perform indexing and search operations.

Sphinx support in PHP comes through PECL’s ext/sphinx extension, which is maintained by Antony Dovgal, and provides an object-oriented API for accessing the Sphinx client API. Although this extension doesn’t support creation of Sphinx indexes (you need the Sphinx utilities for that), it still allows you to execute different types of search queries, group results, and adjust index weights and filters.

To get started with ext/sphinx, first download the Sphinx source archive (v0.9.8.1) and compile the Sphinx binaries and libraries for your system. Assuming you’re on a *NIX system, here’s how:

shell> tar -xzvf sphinx-0.9.8.1.tar-gz
shell> cd sphinx-0.9.8.1
shell> ./configure
shell> make
shell> make install
shell> cd api/libsphinxclient
shell> ./configure
shell> make
shell> make install

Once this is done, download, compile and install the latest revision of the PHP Sphinx extension using the PECL automated installer, as below:

shell> pecl install sphinx

You can also accomplish this manually, by downloading the Sphinx extension source code and installing it with the phpize command:

shell> tar -xzvf sphinx-1.0.0.tar-gz
shell> cd sphinx-1.0.0/
shell> phpize
shell> ./configure
shell> make
shell> make install

Whichever method you choose, you should end up with a loadable PHP module named sphinx.so in your PHP extension directory. You should now enable the extension in the php.ini configuration file, restart your Web server, and check that the extension is enabled with a quick call to phpinfo():

With all the pieces in place, let’s get started with Zend_Search_Lucene. One disclaimer before we start: I’m not an expert in either Lucene or Sphinx, and the examples that follow are based on my own (limited) experience with both tools. It’s quite possible there’s a “better way” of doing the tasks outlined below, so you should research both tools in detail before using them in production applications.

Index Page

There are two components to indexing data with both Zend_Search_Lucene and Sphinx: the indexer and the search engine. The indexer processes a collection of documents and builds an index of their contents; this index can then be searched for matches using the search engine. A document may itself be broken down further into fields; these fields are how the indexer knows what to index.

Zend_Search_Lucene allows users precise control over how each field of a document should be treated. The two basic parameters here are indexing and storage: indexed fields can be used in searches, while stored fields can be displayed in search results. So, for example, Keyword fields are not tokenized, but are stored as-is within the index, while Text fields are both tokenized and indexed. UnStored fields are indexed but not stored, while UnIndexed fields are stored but not indexed. When making a determination as to which field types to use in your Zend_Search_Lucene index, it’s important to have a clear idea of which fields you’ll be using as search criteria, and which fields you plan to display in search results.

Let’s look at how this plays out in a PHP script:

<?php
// turn off execution time limit
ini_set('max_execution_time', 0);

// include class files
include_once 'Mail/Mbox.php';
include_once 'Mail/mimeDecode.php';
include_once 'Zend/Search/Lucene.php';

// set index directory
$index = Zend_Search_Lucene::create('/tmp/index');

// set mail directory path from CLI argument
$mailDir = $argv[1];

// set array for mbox name search/replace
$search = array(
  $mailDir,
  '.sbd',
);

try {
  // recursively process mailbox directory
  $iterator  = new RecursiveIteratorIterator(
    new RecursiveDirectoryIterator($mailDir)
  );
  foreach ($iterator as $key => $value) {
    // exclude certain file types
    if (preg_match('/(.msf|.dat|.html)$/m', $key)) {
      continue;
    }

    // exclude certain folders
    if (preg_match('/(Trash|Drafts)$/m', $key)) {
      continue;
    }

    // get folder name
    $folder = str_replace($search, '', $key);

    // open mailbox file
    $mbox = new Mail_Mbox($key);
    $mbox->open();

    // get message count
    // iterate over messages
    $count = $mbox->size();
    echo "$folder - $count \n";
    for($x=0; $x<$count; $x++) {
      // retrieve headers and body
      $message = $mbox->get($x);
      $decode = new Mail_mimeDecode($message, "\r\n");
      $structure = $decode->decode(array('include_bodies' => true));
      $subject = $structure->headers['subject'];
      $from = $structure->headers['from'];
      $to = $structure->headers['to'];
      $cc = $structure->headers['cc'];
      $date = $structure->headers['date'];
      $body = $structure->body;

      // create new document in index
      $doc = new Zend_Search_Lucene_Document();

      // index and store header fields
      $doc->addField(Zend_Search_Lucene_Field::UnIndexed('mbox', $folder));
      $doc->addField(Zend_Search_Lucene_Field::Text('h_date', $date));
      $doc->addField(Zend_Search_Lucene_Field::Text('h_subject', $subject));
      $doc->addField(Zend_Search_Lucene_Field::Text('h_from', $from));
      $doc->addField(Zend_Search_Lucene_Field::Text('h_to', $to));
      $doc->addField(Zend_Search_Lucene_Field::Text('h_cc', $cc));

      // index body
      $doc->addField(Zend_Search_Lucene_Field::UnStored('body', $body));

      // save result to index
      $index->addDocument($doc);
    }

    // close mailbox file
    $mbox->close();
  }
} catch (Exception $e) {
  die('ERROR: ' . $e->getMessage());
}
?>

The first part of this script includes the necessary class files, and sets the directory location for the index files via a call to Zend_Search_Lucene::create(). The mailbox directory is specified as a command-line argument, and retrieved from the special $argv array at run-time. A RecursiveDirectoryIterator then takes care of iterating over this directory and retrieving its contents.

All the files in the mailbox directory are not necessarily mailboxes. Mozilla Thunderbird stores some mailbox data in .dat files, while each mailbox also has its own summary file. These files are filtered out from processing, as are certain specified mailboxes like Trash and Drafts. The remaining mailbox files are then passed to the Mail_Mbox class, which provides methods to parse their contents, count the number of messages and iterate over the message collection. The Mail_mimeDecode class provides additional utility methods to retrieve individual message headers such as From, To and Subject.

Each message retrieved in this manner is represented as a Zend_Search_Lucene_Document object, and the individual fields of each message represented as Zend_Search_Lucene_Field objects. These fields are indexed and added to the document with the Zend_Search_Lucene_Document::addField() method, and the final document is then added to the index with the Zend_Search_Lucene::addDocument() method.

This script can be run at the command-line by passing it the name of the mailbox directory as argument:

shell> php lucene-create-index.php /mnt/mail/local

Here’s an example of what you might see as the index is being built:

Once the process is complete, look in the index directory and you might see something like this:

Searching For Answers

Indexing is only half the picture, though – you still need a way to query the index for messages matching specific criteria and display these matching results. Zend_Search_Lucene comes with a full-featured search implementation that allows you to do just this.

Here’s a simple example of a script that queries the index created in the previous step:

<?php
// include class files
include_once 'Zend/Search/Lucene.php';

// open index
$index = Zend_Search_Lucene::open('/tmp/index');

// get query from command-line
$queryStr = $argv[1];

// run query
// iterate over list of matches
$hits = $index->find(
  Zend_Search_Lucene_Search_QueryParser::parse($queryStr)
);
echo count($hits) . " hit(s) \n\n";
foreach ($hits as $hit) {
  echo 'Mailbox: ' . $hit->mbox . ", ";
  echo 'Score: ' . $hit->score . "\n";
  echo 'From: ' . $hit->h_from . "\n";
  echo 'To: ' . $hit->h_to . "\n";
  echo 'Subject: ' . $hit->h_subject . "\n";
  echo 'Date: ' . $hit->h_date . "\n";
  echo  "\n";
}
?>

This script first opens a handle to the index created earlier using the Zend_Search_Lucene::open() method. It then reads the search query from the command line and uses the Zend_Search_Lucene object’s find() method to scan the index for matching documents. For each match found, it displays the mailbox name, score, sender and recipient, subject and date – more than enough information to locate the actual message in the mailbox.

Here’s an example of using this script to find all messages containing the string “book”:

shell> php lucene-search-index.php book

And here’s a sample of the output:

When passed multiple search terms, Zend_Search_Lucene will perform a Boolean OR query, returning documents which match any of the terms. Here’s an example, which returns all messages containing the string “river” or “prawn”:

shell> php lucene-search-index.php "river prawn"

To perform a Boolean AND query, separate the query terms with the AND keyword, as below:

shell> php lucene-search-index.php "river AND prawn"

To exclude certain terms from the search query, use the NOT keyword, as below:

shell> php lucene-search-index.php "prawn AND NOT tiger"

By default, Zend_Search_Lucene will search all fields for a match to the query term. You can restrict the search to specific fields by naming them in the query string. Here’s an example which searches for all messages from “roger”:

shell> php lucene-search-index.php h_from:roger

And here’s another one which searches for all messages sent to “harry” on any day other than Thursday and containing the word “place” in their subject line:

shell> php lucene-search-index.php "h_subject:place AND h_to:harry AND NOT h_date:thu"

Now, let’s try doing the same thing with Sphinx.

Meeting The Sphinx

Like Zend_Search_Lucene, Sphinx creates its indexes by working its way through a collection of documents, tokenizing and indexing the fields within it. However, unlike Zend_Search_Lucene, Sphinx offers a client API only for search operations and not for index creation operations. Index creation is accomplished using the Sphinx command-line indexer.

The Sphinx indexer comes with built-in support for ‘mysql’ and ‘postgresql’ data sources, and by far the most common way to feed documents to it is via an SQL query: the indexer performs the query, treats each record in the result set as a document, and indexes the corresponding fields. For content that doesn’t come from a database (think email messages), the Sphinx indexer also supports XML, receiving and indexing documents encoded in a particular XML format via the ‘xmlpipe2′ data source. Here’s an example of what this XML format looks like:

<?xml version="1.0" encoding="utf-8"?>
<sphinx:docset>
  <sphinx:document id="1">
    <item>Wing of bat</item>
    <qty>23</qty>
    <price>2.99</price>
    <note>Yum yum</note>
  </sphinx:document>
  <sphinx:document id="2">
    ...
  </sphinx:document>
  ...
</sphinx:docset>

Each document has a unique non-zero numeric identifier, and is composed of both fields and attributes. Fields are indexed for full-text search but not stored in the index, while attributes are stored but not indexed for full-text search. Searches are performed against fields, and attributes are used for filtering or grouping the search results. Currently, Sphinx only permits numeric attributes; however, support for string attributes has been added to Sphinx v0.9.10.

Based on the above, it should be clear that there are a couple of problems you’re likely to encounter when using Sphinx to index email content:

  • Sphinx requires each document to have a unique numeric ID. However, email messages in mailbox format do not have unique numeric IDs.
  • Sphinx currently only allows attributes (numeric) to be stored in Sphinx indexes and displayed in search results. Key message headers, such as sender, recipient and subject, are strings and therefore cannot be stored as Sphinx document attributes or displayed in Sphinx search results.

To solve the above problems, the Sphinx mailing list suggests the use of a database table to “tie together” email messages and Sphinx document IDs. Under this approach, the database stores a copy of the key headers of each message (not the body) under a unique record ID; this ID is the same ID fed to Sphinx when indexing the message. When a Sphinx search is performed, it returns a list of matching document IDs; a simple SQL query can then be used to look up these IDs in the database and retrieve the corresponding message headers.

X Marks The Spot

The first step is to create the database table that will store message headers. Start up the MySQL command-line client and use the following SQL to create the table:

CREATE TABLE IF NOT EXISTS mail (
  id int(10) unsigned NOT NULL AUTO_INCREMENT,
  mbox varchar(255) NOT NULL,
  h_from varchar(255) NOT NULL,
  h_to varchar(255) NOT NULL,
  h_subject varchar(255) NOT NULL,
  h_date varchar(255) NOT NULL,
  PRIMARY KEY (id)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8;

Sphinx’s indexing and searching behaviour is controlled via the Sphinx configuration file. A fully-commented version of this file is included in the Sphinx distribution and installed to /usr/local/etc/sphinx.conf by default. This file defines the data source, the index characteristics, and the configuration parameters for the Sphinx search daemon (searchd). Open this file in your text editor and fill it with the following directives:

#############################################################################
## data source definition
#############################################################################

source src
{
  type = xmlpipe2
  xmlpipe_command = php /usr/local/apache2/htdocs/sphinx-create-index.php /mnt/mail/local/

  # xmlpipe2 field declaration
  xmlpipe_field = h_from
  xmlpipe_field = h_to
  xmlpipe_field = h_cc
  xmlpipe_field = h_subject
  xmlpipe_field = h_date
  xmlpipe_field = mbox
  xmlpipe_field = body
}

#############################################################################
## index definition
#############################################################################

index mail
{
  # document source(s) to index
  source = src

  # index files path and file name, without extension
  path = /tmp/sphinx/index

  # a list of morphology preprocessors to apply
  morphology = none

  # charset encoding type
  charset_type = utf-8

  # whether to strip HTML tags from incoming documents
  html_strip = 1
}

#############################################################################
## searchd settings
#############################################################################

searchd
{
  # searchd TCP port number
  port = 3312

  # log file, searchd run info is logged here
  log = /usr/local/var/log/searchd.log

  # query log file, all search queries are logged here
  query_log = /usr/local/var/log/query.log

  # client read timeout, seconds
  read_timeout = 5

  # maximum amount of children to fork (concurrent searches to run)
  max_children = 30

  # PID file, searchd process ID file name
  pid_file = /tmp/searchd.pid

  # max amount of matches the daemon ever keeps in RAM, per-index
  max_matches = 1000

  # seamless rotate, prevents rotate stalls if precaching huge datasets
  seamless_rotate = 1

  # whether to forcibly preopen all indexes on startup
  preopen_indexes = 0

  # whether to unlink .old index copies on succesful rotation.
  unlink_old = 1
}

# --eof--

Here’s a quick explanation of the main sections of this file:

  • The 'source' section defines the data source – in this case, XML data generated by the 'xmlpipe_command' and piped through the 'xmlpipe2' data source. The fields to be indexed by Sphinx are also specified in this section using 'xmlpipe_field' declarations.
  • The 'index' section defines the characteristics of the Sphinx index that will be built. This includes the index path on the local file system, the character set, stopwords, exclusions, boundary characters, and many other attributes that allow precise control over the indexer’s behaviour (here’s a complete list).
  • The 'searchd' section controls the behaviour of the sphinx search server, used when performing searches. This includes information on the server address, port, log file and memory consumption. Most of the time, if your server is running on the same machine, you can leave this section at its default values.

Here’s the PHP script that takes care of parsing the mailboxes and representing them in XML such that they can be indexed by Sphinx. It also takes care of inserting each message’s headers into the MySQL database.

<?php
// turn off execution time limit
ini_set('max_execution_time', 0);
include_once 'Mail/Mbox.php';
include_once 'Mail/mimeDecode.php';

// set mail directory path
$mailDir = $argv[1];

// set array for mbox name search/replace
$search = array(
  $mailDir,
  '.sbd',
);

try {
  // open MySQL connection
  $connection = mysqli_connect('localhost', 'test', 'secret', 'test') or die ("ERROR: Cannot connect");

  // create prepared statement
  $sql = "INSERT INTO mail (id, mbox, h_from, h_to, h_subject, h_date) VALUES (?, ?, ?, ?, ?, ?)";
  $stmt = mysqli_prepare($connection, $sql);

  // bind parameters to statement
  mysqli_stmt_bind_param($stmt, 'isssss', $ctr, $folder, $from, $to, $subject, $date);

  // create XML document
  $dom = new DOMDocument('1.0', 'utf-8');

  // create root element
  $root = $dom->createElement('sphinx:docset');
  $dom->appendChild($root);

  // recursively process mailbox directory
  $iterator  = new RecursiveIteratorIterator(
    new RecursiveDirectoryIterator($mailDir)
  );  

  $ctr = 1;
  foreach ($iterator as $key => $value) {
    // exclude certain file types
    if (preg_match('/(.msf|.dat|.html)$/m', $key)) {
      continue;
    }

    // exclude certain folders
    if (preg_match('/(Trash|Drafts)$/m', $key)) {
      continue;
    }

    // get folder name
    $folder = str_replace($search, '', $key);

    // open mailbox file
    $mbox = new Mail_Mbox($key);
    $mbox->open();

    // get message count
    // iterate over messages
    $count = $mbox->size();
    for($x=0; $x<$count; $x++) {
      // retrieve headers and body
      $message = $mbox->get($x);
      $decode = new Mail_mimeDecode($message, "\r\n");
      $structure = $decode->decode(array('include_bodies' => true));
      $subject = $structure->headers['subject'];
      $from = $structure->headers['from'];
      $to = $structure->headers['to'];
      $cc = $structure->headers['cc'];
      $date = $structure->headers['date'];
      $body = $structure->body;
      $body = str_replace('&nbsp;', '', $body);

      // save metadata to MySQL
      mysqli_stmt_execute($stmt);

      // add nodes to XML output
      $document = $dom->createElement('sphinx:document');
      $document->setAttribute('id', $ctr);
      $root->appendChild($document);
      $elemMbox = $dom->createElement('mbox', $folder);
      $document->appendChild($elemMbox);
      $elemFrom = $dom->createElement('h_from', htmlspecialchars(utf8_encode($from), ENT_QUOTES, 'UTF-8'));
      $document->appendChild($elemFrom);
      $elemTo = $dom->createElement('h_to', htmlspecialchars(utf8_encode($to), ENT_QUOTES, 'UTF-8'));
      $document->appendChild($elemTo);
      $elemCc = $dom->createElement('h_cc', htmlspecialchars(utf8_encode($cc), ENT_QUOTES, 'UTF-8'));
      $document->appendChild($elemCc);
      $elemDate = $dom->createElement('h_date', $date);
      $document->appendChild($elemDate);
      $elemSubject = $dom->createElement('h_subject', htmlspecialchars(utf8_encode($subject), ENT_QUOTES, 'UTF-8'));
      $document->appendChild($elemSubject);
      $elemBody = $dom->createElement('body');
      $cdataBody = $dom->createCDATASection(utf8_encode($body));
      $document->appendChild($elemBody);
      $elemBody->appendChild($cdataBody);

      // increment document counter
      $ctr++;
  }

  // close mailbox file
  $mbox->close();
  }  

  // close MySQL connection
  mysqli_stmt_close($stmt);
  mysqli_close($connection);  

  // dump XML output
  echo $dom->saveXML();
} catch (Exception $e) {
  die('ERROR: ' . $e->getMessage());
}
?>

The first part of this script includes the necessary class files, reads the mailbox directory from the command line, and initializes a RecursiveDirectoryIterator to iterate over the directory tree and retrieve its contents. As before, unnecessary mailboxes and files are filtered out, and the Mail_Mbox and Mail_mimeDecode classes are then used to retrieve the messages in each mailbox.

Each message retrieved in this manner is represented as a <sphinx:document> element, with child elements corresponding to the individual message headers and body. PHP’s DOM extension is used to dynamically generate these XML elements, and combine them all into a single well-formed XML document, suitable for feeding to the Sphinx indexer via the xmlpipe2 utility. PHP’s MySQLi extension takes care of opening a connection to the database and inserting message data into the database using a prepared statement. Each message is assigned a unique numeric ID in the XML; this ID corresponds to the message’s record ID in the MySQL database.

For large message archives, the above process can become very memory-intensive, so consider indexing the mailboxes in batches (always remembering to update the starting document ID in to ensure there is no duplication of records) and/or first writing the XML output to a file and then piping the file contents to xmlpipe2. You could also consider generating the XML using PHP’s xmlwriter extension, as shown in this Sphinx example script.

All that’s left now is to start up the Sphinx indexer and get it started with indexing the email archive:

shell> /usr/local/bin/indexer mail

Once the process is complete, look in the index directory and you might see something like this:

You can test your index with the search command-line utility:

shell> /usr/local/bin/search prague

Here’s what you might see:

Notice that Sphinx only returns the document IDs of all matching documents – you still need to go to the database to get the message headers.

While a full discussion of Sphinx search syntax is beyond the scope of this article (details here), it’s worthwhile mentioning the -b option, which switches the search utility into Boolean mode and allows you to use logical operators such as AND, OR and NOT in search queries. Here’s an example of searching for documents containing both “prague” and “london”:

shell> /usr/local/bin/search -b "prague & london"

And here’s an example of searching for documents containing “prague” or “london”:

shell> /usr/local/bin/search -b "prague | london"

Now, let’s try doing the same thing with PHP.

Match Point

When it comes to querying your index from an application, Sphinx offers a client-server architecture. Queries are sent to the Sphinx search daemon (searchd), which takes care of searching the index for matches and returning results to the requesting client.

To see this in action, drop to your shell and start up the search daemon:

shell> /usr/local/bin/searchd

The Sphinx search daemon should start up using the configuration directives specified in the sphinx.conf configuration file. Here’s an image of what you will see:

The Sphinx PECL extension provides a full-featured client API that allows developers to send queries to the Sphinx search daemon and retrieve responses, in much the same way as the command-line client does. These results can then be integrated with an SQL query to present detailed information for each matched document. Here’s an example:

<?php
try {
  // create Sphinx client
  $sphx = new SphinxClient();

  // set server and search parameters
  $sphx->setServer('localhost', 3312);
  $sphx->setLimits(0, 200, 1000);
  $sphx->setMatchMode(SPH_MATCH_ANY);

  // get and run query from command-line
  $queryStr = $argv[1];
  $result = $sphx->query($queryStr, 'mail');
  echo $result['total_found'] . " hit(s) \n\n";

  // get document IDs
  $ids = implode(',', array_keys($result['matches']));

  // open MySQL connection
  $connection = mysqli_connect('localhost', 'test', 'secret', 'test') or die ("ERROR: Cannot connect");

  // read records matching document IDs
  $sql = "SELECT id, mbox, h_from, h_to, h_subject, h_date FROM mail WHERE id IN ($ids)";
  $rs = mysqli_query($connection, $sql) or die ("ERROR: " . mysqli_error($connection) . " (query was $sql)");
  // print records
  if (mysqli_num_rows($rs) > 0) {
   $ctr = 1;
   while($hit = mysqli_fetch_object($rs)) {
    echo $ctr . '. Mailbox: ' . $hit->mbox . ", ";
    echo 'From: ' . $hit->h_from . "\n";
    echo 'To: ' . $hit->h_to . "\n";
    echo 'Subject: ' . $hit->h_subject . "\n";
    echo 'Date: ' . $hit->h_date . "\n";
    echo  "\n";
    $ctr++;
   }
  }

  // close MySQL connection
  mysqli_free_result($rs);
  mysqli_close($connection);
} catch (Exception $e) {
  die('ERROR: ' . $e->getMessage());
}
?>

This script begins by initializing a new SphinxClient() object, and then using the object’s setServer() method to point it to the running searchd instance. The setLimits() method works like a LIMIT clause to specify the offset and number of results to be returned (here, the first 200 results), and the setMatchMode() method specifies how matching should occur. Valid values for setMatchMode() include:

  • SPH_MATCH_ALL – match all query terms
  • SPH_MATCH_ANY – match any query terms
  • SPH_MATCH_PHRASE – match query term as a phrase
  • SPH_MATCH_BOOLEAN – match query term in Boolean mode
  • SPH_MATCH_EXTENDED – match query term in extended mode

The query term itself is retrieved from the command line using the special $argv array, and used to execute a query with the query() method. The return value of this method is an array containing result statistics and a set of matches (as defined by the setLimits() method). Here’s an example of what this array looks like:

Array
(
    [error] =>
    [warning] =>
    [status] => 0
    [fields] => Array
        (
            [0] => h_from
            [1] => h_to
            [2] => h_cc
            [3] => h_subject
            [4] => h_date
            [5] => mbox
            [6] => body
        )

    [attrs] => Array
        (
        )

    [matches] => Array
        (
            [22] => Array
                (
                    [weight] => 1
                    [attrs] => Array
                        (
                        )

                )

            [24] => Array
                (
                    [weight] => 1
                    [attrs] => Array
                        (
                        )

                )

             ...

    [total] => 156
    [total_found] => 156
    [time] => 0
    [words] => Array
        (
            [book] => Array
                (
                    [docs] => 156
                    [hits] => 351
                )

        )
)

It’s now reasonably easy to iterate over this array, build a list of the matching document IDs, and interpolate them into an SQL query that retrieves the mailbox name, sender and recipient, subject and date from the MySQL database – more than enough information to locate the actual message.

To see this in action, try running a search for “book”, as below:

shell> php sphinx-search-index book

Here’s an example of the output you might see:

To perform a Boolean search, change the argument passed to setMatchMode(), and then try a query like this:

shell> php sphinx-search-index "prague & london"

Here’s an example of the output you might see:

As these examples illustrate, both Sphinx and Zend_Search_Lucene make it possible to create full-text indexes of textual content, and then build applications to query these indexes and display matching results. Try them out the next time you have a project that requires powerful search capabilities, and see what you think!

Published: August 19th, 2009 at 4:58
Categories: Zend Framework
Tags: , , ,

4 comments to “Indexing Email Messages with PHP, Zend Lucene and Sphinx”

Apache Solr (http://lucene.apache.org/solr) is a search index server similar to sphinx with a lot of great features – sorry I can’t give a full comparison to sphinx since I’ve never used sphinx myself. It runs as a web application inside of a servlet container such as tomcat.

And of course, the shameless plug for the Solr PHP client that I’m involved with: http://code.google.com/p/solr-php-client

Hi Vikram,
Thanks for greate post.
I post a short note about you post on my blog http://pro100pro.com/using-sphinx-lucene-with-zendframework-for-parse-emails

is it work with joomla?

It would be interesting if you could comment on the relative performance differences between the two. For instance, did Lucene use up loads of memory and take a long time to return hits – or was it as fast as Solr? How many emails/documents were in the index?