Categories


Loading feed
Loading feed
Loading feed

Roll Your Own Search Engine with Zend_Search_Lucene


Introduction

On several occasions developing database-driven web applications, I've been approached by clients who want Google-style search implemented at the last minute of the development cycle. Usually this leads to using some canned script that crawls the website, or a hacked up search function that uses the database but either returns too many results or none at all. On top of that, the queries performed are too many or too slow.

Until now, most developers have been forced to use relational databases to power search, install extra component packages, or seek out other non-php solutions. The problem with using a relational database, such as MySql's fulltext indexing, is that scalability problems crop up as your search criteria becomes more complicated.

One of the features that sets the Zend Framework apart from the others is the inclusion of a decent search module. Zend_Search_Lucene is a php port of the Apache Lucene project, a full-text search engine framework. Zend_Search_Lucene promises a simple way to add search functionality to an application without requiring additional php extensions or even a database.

Zend_Search_Lucene overcomes the usual limitations of relational databases with features such as fast indexing, ranked result sets, a powerful but simple query syntax, and the ability to index multiple fields. Better still, a Zend_Search_Lucene index can live happily alongside your relational database to provide fast searching but without duplicating the effort of storing all of your data twice. In this tutorial, I'll show you how to use Zend_Search_Lucene to index and search some RSS feeds.

Creating the Index

Before you can search your data, you have to create an index. One advantage to using a Zend_Search_Lucene index is that it is binary compatible Lucene version 1.4 and above, meaning you can create indexes with other ports of Lucene and search them with Zend_Search_Lucene. A Zend_Search_Lucene index is made up of Documents--think database records--which have Fields, and Fields hold the content that is searched.

In this example, we want to grab a few php-related RSS feeds and index the contents so we can search. To make things easy, I'll also use the Zend_Feed module so I don't have to deal with all that low-level XML business.

Note: This tutorial uses the 0.1.2 tagged release of the Zend Framework

<?php

require_once 'Zend/Feed.php';
require_once 'Zend/Search/Lucene.php';

function sanitize($input) {
	return htmlentities(strip_tags( $input ));
}

//create the index
$index = new Zend_Search_Lucene('/tmp/feeds_index', true);

$feeds = Array('http://feeds.feedburner.com/ZendDeveloperZone',
				'http://www.planet-php.net/rss/',
				'http://www.sitepoint.com/blogs/category/php/feed/',
				);

//grab each feed
foreach ($feeds as $feed) {

	$channel = Zend_Feed::import($feed);
	
	echo $channel->title()."\n";
	
	// index each item
	foreach ($channel->items as $item) {
		if ($item->link() && $item->title() && $item->description()) {
            
		    $doc = new Zend_Search_Lucene_Document();
		        
			$doc->addField(Zend_Search_Lucene_Field::Keyword('link', 
				sanitize($item->link())));

			$doc->addField(Zend_Search_Lucene_Field::Text('title', 
				sanitize($item->title())));

			$doc->addField(Zend_Search_Lucene_Field::Unstored('contents', 
				sanitize($item->description())));

			echo "\tAdding: ".$item->title()."\n";
			$index->addDocument($doc);
		}
	}
}
$index->commit();

echo $index->count()." Documents indexed.\n";

?>   

The first step after we include the framework files is to actually create the Zend_Search_Lucene object and specify the location to store it. The second parameter indicates that we want to create a fresh index:

//create the index
$index = new Zend_Search_Lucene('/tmp/feeds_index', true);

Next, we specify the RSS feeds we are interested in and fetch them in a loop. Then, with each feed we loop through the articles and index each one as a seperate Zend_Search_Lucene document. Here is the feed fetching and looping code displayed once again so you can differentiate the feed processing from the indexing. Note that in these code examples I've omitted most error checking for the sake of clarity.

$feeds = Array('http://feeds.feedburner.com/ZendDeveloperZone',
				'http://www.planet-php.net/rss/',
				'http://www.sitepoint.com/blogs/category/php/feed/',
				);


//grab each feed
foreach ($feeds as $feed) {

	$channel = Zend_Feed::import($feed);
	
	echo $channel->title()."\n";
	
	// index each item
	foreach ($channel->items as $item) {
		if ($item->link() && $item->title() && $item->description()) {
            
			//Create and index a ZSearch Document		   

		}
	}
}

To add a document to our index, we create the document object and specify content for the document's fields. Zend_Search_Lucene provides different ways to analyze and store fields depending on how we need to search them and return the results. In this example, for each RSS item, we want to index the link, title, and description.

$doc = new Zend_Search_Lucene_Document();
	
$doc->addField(Zend_Search_Lucene_Field::Keyword('link', 
	sanitize($item->link())));

$doc->addField(Zend_Search_Lucene_Field::Text('title', 
	sanitize($item->title())));

$doc->addField(Zend_Search_Lucene_Field::Unstored('contents', 
	sanitize($item->description())));

echo "\tAdding: ".$item->title()."\n";
$index->addDocument($doc);

Note: So I cheated a bit. Zend_Search_Lucene does not support non-ASCII characters at the moment, so my sanitize() function passes everything through htmlentities(). I'm also stripping tags because I'm interested in the textual content and not its presentation.

You'll notice that I've indexed each field using a different static method of the Zend_Search_Lucene_Field class. Let's look at the available field types and how they differ:

value stored? indexed? tokenized? binary?
Keyword yes yes no no
UnIndexed yes no no no
Binary yes no no yes
Text yes yes yes no
UnStored no yes yes no

Keyword fields are stored and indexed, meaning I can search them as well as display them back in my search results. They are not split up into seperate words by tokenization. My link field is a good candidate for a Keyword because I might want to search articles by link URL, and I definitely want to display the link in the search results since the link is serving as my external identifier for the document. Enumerated database fields usually translate well to Keyword fields in Zend_Search_Lucene.

It's usually a good idea to store an identifier for each document that can be used as a lookup mechanism in the search results. For this example, it makes sense to use the RSS item's link. If we were building an index from an existing relational database, we would want to store the primary key of the record, and if we were indexing a file system we would probably want to store the path to the file.

UnIndexed fields are not searchable, but they are returned with search hits. Database timestamps, primary keys, file system paths, and other external identifiers are good candidates for UnIndexed fields.

Binary fields are not tokenized or indexed, but are stored for retrieval with search hits. They can be used to store any data encoded as a binary string, such as an image icon.

Text fields are stored, indexed, and tokenized. Text fields are appropriate for storing information like subjects and titles that need to be searchable as well as returned with search results. In my example, the title field of the RSS articles are indexed as Text fields.

UnStored fields are tokenized and indexed, but not stored in the index. Large amounts of text are best indexed using this type of field. Storing data creates a larger index on disk, so if you need to search but not redisplay the data, use an UnStored field. In my example, the RSS description--the main body of text--is stored as an UnStored field. UnStored fields are particularly practical when using a Zend_Search_Lucene index in combination with a relational database. You can index large data fields with UnStored fields for searching, and retrieve them from your relational database by using a seperate fields as an identifier.

It's also important to note that we named the field to store the description 'contents'. This is no accident. This is the field name that Zend_Search_Lucene will search by default. Internal discussion with the Framework development team is leading to the idea that Zend_Search_Lucene may break away from the Lucene norm and implement a simple way to search all fields instead of just the 'contents' field.

Searching the Index

Now that we have created a Zend_Search_Lucene index, let's put it to use by performing some searches. You can implement search on an index in just a couple dozen lines of code:

<?php

require_once 'Zend/Search/Lucene.php';

//open the index
$index = new Zend_Search_Lucene('/tmp/feeds_index');

$query = 'framework';

$hits = $index->find($query);

echo "Index contains ".$index->count()." documents.\n\n";

echo "Search for '".$query."' returned " .count($hits). " hits\n\n";

foreach ($hits as $hit) {
	echo $hit->title."\n";
	echo "\tScore: ".sprintf('%.2f', $hit->score)."\n";
	echo "\t".$hit->link."\n\n";
}

?>

Could it be any easier? We include the library, open our index, seach for a term, and iterate through the result set.You should note that since we used the default case insensitive text analyzer to build the index, the search query should be lowercase.

The Zend_Search_Lucene query format is powerful but simple. It's a snap to specify multiple query terms with a special syntax.

To search our RSS index for articles that must contain the word 'framework' in the 'contents' field:

$query = '+framework';

For articles with 'Zend' in the title:

$query = 'title:zend';

For articles with containing the word 'framework' but without the word 'Zend' in the title:

$query = 'framework -title:zend';

Conclusion

In these simple examples, we have seen that the Zend_Search_Lucene module provides an easy way to add customized search functionality to an any php application without a dependance on external software packages. As the Zend_Search_Lucene module matures, it will no doubt prove to be a prized component of the Zend Framework. In future articles I hope to explore advanced indexing and search capabilities of Zend_Search_Lucene, and put the module through some real-life benchmarks using large data sets, comparing indexing and search performance against some other current popular methods.

Comments


Tuesday, March 28, 2006
AND WHAT ABOUT UTF-8 IN INDEXING?
5:18PM PST · Schleicher [unregistered]
Wednesday, March 29, 2006
AND WHAT ABOUT UTF-8 IN INDEXING?
1:28AM PST · Cal Evans (roving reporter)
Thursday, March 30, 2006
UTF-8 INDEXING
7:57AM PST · Vincent [unregistered]
USE LUCENE INDEX BUILT FROM LUCENE-PORTS?
5:22PM PST · FlatPredator
BINARY COMPATIBLE INDEXES
7:10PM PST · John Herren
THANKS FOR THE HINT
10:44PM PST · FlatPredator
Friday, March 31, 2006
WELL, MAYBE NOT.
8:01AM PST · John Herren
Monday, April 3, 2006
UTF-8
8:07PM PDT · Lucyjan [unregistered]
Monday, April 17, 2006
USING ZEND_SEARCH HELP
6:39AM PDT · prax_75
Thursday, April 20, 2006
WANT TO PLAY MORE WITH THE SEARCH ENGINE...
6:05AM PDT · darma5
Friday, April 21, 2006
CHECK THE MAILING LIST
12:27AM PDT · John Herren (staff)
Monday, June 26, 2006
REGARDING LUCENE SEARCH WHEN NUMBERS INSIDE THE TEXT
12:19AM PDT · bonasaju
Friday, July 14, 2006
CANT MAKE IT WORKS ...
12:28AM PDT · facundopagani
Friday, July 21, 2006
SEARCH ON ALL FIELDS?
8:00PM PDT · pentarim
Monday, August 7, 2006
PROBLEMS
1:11AM PDT · draxtor
Wednesday, August 9, 2006
USING ZEND SEARCH FOR .TXT, .DOC OR PDF FILES
2:43AM PDT · scan [unregistered]
Tuesday, September 26, 2006
LUCENE +PHP
2:25AM PDT · Anonymous User [unregistered]
Thursday, December 14, 2006
HELLO
5:50PM PST · ctrick
Sunday, December 31, 2006
REGARDING LUCENE SEARCH WHEN NUMBERS INSIDE THE TEXT - REDUX
6:59AM PST · jsloan
Thursday, March 8, 2007
ERROR RUN ON LOCALHOST
7:21PM PST · tenfoon
Thursday, September 27, 2007
SOME THINGS I FOUND
6:11PM PDT · Dean [unregistered]
Wednesday, October 31, 2007
CUSTOM SEARCH SCORING ALGORITHM SIMILAR TO MYSQL RLIKE
11:05AM PDT · schniblet