Implementing a Stemming Analyzer for Zend_Search_Lucene

The Zend implementation of Lucene provides a powerful tool set for those looking to implement a Google-like search for their PHP web application. One of the requirements in creating a Google-like search with Zend is the creation of a stemming, stop word filtering, lower-casing analyzer.

This article will briefly discuss the basic role of an analyzer in the Lucene API, my implementation of a new "StandardAnalyzer" for the Zend_Search_Lucene component of the Zend Framework, the inner workings of this analyzer, and its basic usage.

An analyzer in Lucene is an object that helps shape documents before they are inserted into the search index. It's also used to shape queries. For example: Suppose a given Wiki decided use lucene as its search engine. The Wiki could use the Lucene API to index its documents. In the process of indexing, Lucene would run each document through the analyzer, which can serve a wide range useful functions.

As an example: In Java's implementation of Lucene, there is an analyzer called "StandardAnalyzer". This analyzer takes a document and stems words, filters out stop words, and converts them to lowercase. A programmer looking to implement a full text search would be wise to use this analyzer, because it makes for a nicely relevant solution. During the indexing process, each document, or in this case, Wiki article, would be placed in the index in its analyzed form (assuming the field is set to be tokenized). So a sentence such as:

Knuth has been called the father of the analysis of algorithms, contributing to the development of, and systematizing formal mathematical techniques for, the rigorous analysis of the computational complexity of algorithms, and in the process popularizing asymptotic notation.

Would end up inside the index as:

knuth ha been call father analysi algorithm contribut develop systemat formal mathemat techniqu rigor analysi comput complex algorithm process popular asymptot notat

Note how words the "the" and "and" were filtered out. Words such as "development" have been stemmed to their roots. This is especially useful considering the task of searching. Suppose a user searches for "analyzing computation". If the programmer had not used a stemming analyzer, the search would result in zero hits. The word "analysis" is not the same as "analyzing" when using a non-stemming analyzer. The word "computation" is also considered irrelevant to "computation". Needless to say, a query for those terms is actually very relevant.

Performing a modification such as this is the major role of an analyzer. Each analyzer usually serves a different purpose, and makes use of different filters available to it.

Zend's Lucene port comes with a set of basic analyzers and filters. There are tools for performing the lower-casing and stop word filtering needed for a good search. A PHP version of the StandardAnalyzer is not currently packaged with Zend_Search_Lucene. The rest of this article will discuss my own implementation of one, where to get it, and how to use it.

My implementation of a PHP standard analyzer can be downloaded from the StandardAnalyzer project page. This analyzer is for the English language and it performs the following functions:

  • Word stemming
  • Stop word filtering
  • Lower-casing

The Donald Knuth example above was generated using this analyzer.

The files linked above contain a sample project using my PHP "StandardAnalyzer" (named after its Lucene counterpart). The project is setup in this fashion:

To get this project running, you will want to place the Zend framework in this folder as well, and have the entire project in your development environment. So you will have something like:

Where it can be accessed by your web server.

The StandardAnalyzer sits alongside the Zend folder, unlike the analyzers already present in Zend/Search/Lucene. This is mainly to leave the framework untouched, and also to leave any modification of the framework up to the fine folks at Zend.

If you access this project through your browser, you will see:

Search Result

The index page just executed a hard-coded search for the word "algorithm" over the index in the data folder (see line 23 of index.php). If you open up index.php to examine what is happening, you will find that the example project basic and in fairly linear in form. If for some reason your index was corrupted (or you simply want to build a new one), you can uncomment line 20:

which will rebuild a index using Zend in the /data folder when the script is run again.

By changing the query $q on line 23, you can search for different words in the index. Try searching for something like "POPULAR" instead. Knuth's hit still comes up. If you type in "Wikipedia" or "Wiki*" (A wildcard search), you will get all 5 documents in the index since they all contain the word "Wikipedia". Of course, there are only five documents in the index, so this example project doesn't exactly show Lucene in all of its glory.

A major part of understanding how the StandardAnalyzer works is actually seeing what is inside your Lucene index. This can be accomplished with a very handy java tool called Luke. You can get a copy of Luke at its project site. Once you start luke (it's a java executable), you can open the index. The index is considered the folder holding the index shards, so you want to choose the "data" folder in the Open dialog.

Once the index is open within Luke, you can click on the documents tab. If you browse to document number 3, and click "Reconstruct & Edit" button, you will be shown a panel which lets you see the tokenized and untokenized (unanalyzed) version of that field in the index. The untokenized text gives a very good idea of how th Analyzer breaks down, or "shapes" the document.

Lucene Luke

So let's get into the usage of the StandardAnalyzer. In any project where you plan to use the StandardAnalyzer, it is assumed you will be using Zend. Your require_once statements should include the following two lines:

Before you write any code involving search or index building, you will want to set the default Zend Lucene analyzer to the StandardAnalyzer. This can be done with:

After that statement, all index building and index querying uses the StandardAnalyzer. It is important that the same analyzer is used for both of these tasks. If an analyzer that simple lower-cases words is used to search over an index with stemmed words, you can't expect too much success.

After that step, you are free to experiment and rock the search world with your new analyzer! Take a look at the Readme.txt file in the project, which will give a few more tips on implementation.

Note: This analyzer makes use of some of the pre-existing token filters provided by Zend, and also a 'PorterStemming' class written by Richard Heyes, which is based on the Porter Stemming Algorithm.

The project can be downloaded from the PHP Standard analyzer project page. I also have a blog post on this topic.