Introduction
On several occasions developing database-driven web applications, I’ve been
approached by clients who want Google-style search implemented at the last minute
of the development cycle. Usually this leads to using some canned script that
crawls the website, or a hacked up search function that uses the database but
either returns too many results or none at all. On top of that, the queries
performed are too many or too slow.
Until now, most developers have been forced to use relational databases to
power search, install extra component packages, or seek out other non-php solutions.
The problem with using a relational database, such as MySql’s fulltext indexing,
is that scalability problems crop up as your search criteria becomes more complicated.
One of the features that sets the Zend Framework apart
from the others is the inclusion of a decent search module. Zend_Search_Lucene is a
php port of the Apache Lucene
project, a full-text search engine framework. Zend_Search_Lucene promises a simple way
to add search functionality to an application without requiring additional php
extensions or even a database.
Zend_Search_Lucene overcomes the usual limitations of relational databases with features
such as fast indexing, ranked result sets, a powerful but simple query syntax,
and the ability to index multiple fields. Better still, a Zend_Search_Lucene index can
live happily alongside your relational database to provide fast searching but
without duplicating the effort of storing all of your data twice. In this tutorial,
I’ll show you how to use Zend_Search_Lucene to index and search some RSS feeds.
Creating the Index
Before you can search your data, you have to create an index. One advantage
to using a Zend_Search_Lucene index is that it is binary compatible Lucene version 1.4
and above, meaning you can create indexes with other
ports of Lucene and search them with Zend_Search_Lucene. A Zend_Search_Lucene index is made up
of Documents–think database records–which have Fields, and Fields hold the
content that is searched.
In this example, we want to grab a few php-related RSS feeds and index the
contents so we can search. To make things easy, I’ll also use the Zend_Feed
module so I don’t have to deal with all that low-level XML business.
Note: This tutorial uses the 0.1.2 tagged release of the Zend Framework
<?php
require_once 'Zend/Feed.php';
require_once 'Zend/Search/Lucene.php';
function sanitize($input) {
return htmlentities(strip_tags( $input ));
}
//create the index
$index = new Zend_Search_Lucene('/tmp/feeds_index', true);
$feeds = Array('http://feeds.feedburner.com/ZendDeveloperZone',
'http://www.planet-php.net/rss/',
'http://www.sitepoint.com/blogs/category/php/feed/',
);
//grab each feed
foreach ($feeds as $feed) {
$channel = Zend_Feed::import($feed);
echo $channel->title()."\n";
// index each item
foreach ($channel->items as $item) {
if ($item->link() && $item->title() && $item->description()) {
$doc = new Zend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::Keyword('link',
sanitize($item->link())));
$doc->addField(Zend_Search_Lucene_Field::Text('title',
sanitize($item->title())));
$doc->addField(Zend_Search_Lucene_Field::Unstored('contents',
sanitize($item->description())));
echo "\tAdding: ".$item->title()."\n";
$index->addDocument($doc);
}
}
}
$index->commit();
echo $index->count()." Documents indexed.\n";
?>
The first step after we include the framework files is to actually create the
Zend_Search_Lucene object and specify the location to store it. The second parameter indicates
that we want to create a fresh index:
//create the index
$index = new Zend_Search_Lucene('/tmp/feeds_index', true);
Next, we specify the RSS feeds we are interested in and fetch them in a loop.
Then, with each feed we loop through the articles and index each one as a seperate
Zend_Search_Lucene document. Here is the feed fetching and looping code displayed once
again so you can differentiate the feed processing from the indexing. Note that
in these code examples I’ve omitted most error checking for the sake of clarity.
$feeds = Array('http://feeds.feedburner.com/ZendDeveloperZone',
'http://www.planet-php.net/rss/',
'http://www.sitepoint.com/blogs/category/php/feed/',
);
//grab each feed
foreach ($feeds as $feed) {
$channel = Zend_Feed::import($feed);
echo $channel->title()."\n";
// index each item
foreach ($channel->items as $item) {
if ($item->link() && $item->title() && $item->description()) {
//Create and index a ZSearch Document
}
}
}
To add a document to our index, we create the document object and specify content
for the document’s fields. Zend_Search_Lucene provides different ways to analyze and store
fields depending on how we need to search them and return the results. In this
example, for each RSS item, we want to index the link, title, and description.
$doc = new Zend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::Keyword('link',
sanitize($item->link())));
$doc->addField(Zend_Search_Lucene_Field::Text('title',
sanitize($item->title())));
$doc->addField(Zend_Search_Lucene_Field::Unstored('contents',
sanitize($item->description())));
echo "\tAdding: ".$item->title()."\n";
$index->addDocument($doc);
Note: So I cheated a bit. Zend_Search_Lucene does not support non-ASCII characters
at the moment, so my sanitize() function passes everything through htmlentities().
I’m also stripping tags because I’m interested in the textual content and not
its presentation.
You’ll notice that I’ve indexed each field using a different static method
of the Zend_Search_Lucene_Field class. Let’s look at the available field types
and how they differ:
| value stored? | indexed? | tokenized? | binary? | |
|---|---|---|---|---|
| Keyword | yes | yes | no | no |
| UnIndexed | yes | no | no | no |
| Binary | yes | no | no | yes |
| Text | yes | yes | yes | no |
| UnStored | no | yes | yes | no |
Keyword fields are stored and indexed, meaning I can search
them as well as display them back in my search results. They are not split up
into seperate words by tokenization. My link field is a good candidate for a
Keyword because I might want to search articles by link URL, and I definitely
want to display the link in the search results since the link is serving as
my external identifier for the document. Enumerated database fields usually
translate well to Keyword fields in Zend_Search_Lucene.
It’s usually a good idea to store an identifier for each document that can
be used as a lookup mechanism in the search results. For this example, it makes
sense to use the RSS item’s link. If we were building an index from an existing
relational database, we would want to store the primary key of the record, and
if we were indexing a file system we would probably want to store the path to
the file.
UnIndexed fields are not searchable, but they are returned
with search hits. Database timestamps, primary keys, file system paths, and
other external identifiers are good candidates for UnIndexed fields.
Binary fields are not tokenized or indexed, but are stored
for retrieval with search hits. They can be used to store any data encoded as
a binary string, such as an image icon.
Text fields are stored, indexed, and tokenized. Text fields
are appropriate for storing information like subjects and titles that need to
be searchable as well as returned with search results. In my example, the title
field of the RSS articles are indexed as Text fields.
UnStored fields are tokenized and indexed, but not stored
in the index. Large amounts of text are best indexed using this type of field.
Storing data creates a larger index on disk, so if you need to search but not
redisplay the data, use an UnStored field. In my example, the RSS description–the
main body of text–is stored as an UnStored field. UnStored fields are particularly
practical when using a Zend_Search_Lucene index in combination with a relational database.
You can index large data fields with UnStored fields for searching, and retrieve
them from your relational database by using a seperate fields as an identifier.
It’s also important to note that we named the field to store the description
‘contents’. This is no accident. This is the field name that Zend_Search_Lucene will search
by default. Internal discussion with the Framework development team is leading
to the idea that Zend_Search_Lucene may break away from the Lucene norm and implement a
simple way to search all fields instead of just the ‘contents’ field.
Searching the Index
Now that we have created a Zend_Search_Lucene index, let’s put it to use by performing
some searches. You can implement search on an index in just a couple dozen lines
of code:
<?php
require_once 'Zend/Search/Lucene.php';
//open the index
$index = new Zend_Search_Lucene('/tmp/feeds_index');
$query = 'framework';
$hits = $index->find($query);
echo "Index contains ".$index->count()." documents.\n\n";
echo "Search for '".$query."' returned " .count($hits). " hits\n\n";
foreach ($hits as $hit) {
echo $hit->title."\n";
echo "\tScore: ".sprintf('%.2f', $hit->score)."\n";
echo "\t".$hit->link."\n\n";
}
?>
Could it be any easier? We include the library, open our index, seach for a
term, and iterate through the result set.You should note that since we used
the default case insensitive text analyzer to build the index, the search query
should be lowercase.
The Zend_Search_Lucene query format is powerful but simple. It’s a snap to specify multiple
query terms with a special syntax.
To search our RSS index for articles that must contain the word ‘framework’
in the ‘contents’ field:
$query = '+framework';
For articles with ‘Zend’ in the title:
$query = 'title:zend';
For articles with containing the word ‘framework’ but without the word ‘Zend’
in the title:
$query = 'framework -title:zend';
Conclusion
In these simple examples, we have seen that the Zend_Search_Lucene module provides an
easy way to add customized search functionality to an any php application without
a dependance on external software packages. As the Zend_Search_Lucene module matures, it
will no doubt prove to be a prized component of the Zend Framework. In future
articles I hope to explore advanced indexing and search capabilities of Zend_Search_Lucene,
and put the module through some real-life benchmarks using large data sets,
comparing indexing and search performance against some other current popular
methods.




March 29, 2006 at 1:18 am
Current (0.1.2) version of Zend Framework works only with single-byte latin encoding?
March 29, 2006 at 9:28 am
This came up on the mailing list this morning. No resolution has been proposed yet. It is however, a known limitation and it is understood that it needs to be high on the list of features to add.
=C=
p.s Great article John!
March 30, 2006 at 3:57 pm
Great article!
UTF-8 indexing is a must have feature! Otherwise, it would be useless for other languages than English.
March 31, 2006 at 1:22 am
Hi,
is it possible to read the index file(s) created by the dotlucene port with the php-lucene-port?
March 31, 2006 at 3:10 am
Yes. As I mentioned in the article, “it is binary compatible Lucene version 1.4 and above, meaning you can create indexes with other ports of Lucene and search them with Zend_Search_Lucene.”
March 31, 2006 at 6:44 am
I read this part just after posting the comment. Next time I will read more carefully
March 31, 2006 at 4:01 pm
Actually, I should have said WILL be compatible. If you are interested in using other Lucene implementations, keep an eye on the mailing list for compatibility updates.
April 4, 2006 at 3:07 am
When we can expect Zend_Search_Lucene to support UTF-8 ?
April 17, 2006 at 1:39 pm
I am having trouble getting zend_search to work.
Basically, I am inserting records into mysql and indexing records using zsearch.
Here are the steps->
1. insert first record:
inserted into mySQL
indexed (segments file and _0.cfs file created)
2. insert second record:
inserted into mySQL
indexed (segments file updated and _1.cfs file created).
*** in the segments file, _0 has been replaced with _1 ***
3. insert third record:
inserted into mySQL
** I get a Zend_Search_Exception -> "Error!: failed to open stream: No such file or directory"
for first record, I create a new index using ->
$index = new ZSearch(‘data/my_index’, true);
for other records, I open the index using ->
$index = new ZSearch(‘data/my_index’);
Question: I think the problem is that a new document is not inserted into the segments file(it is getting replaced).
How do I fix this?
- Thanks.
***********************************************
os: Windows 2000(localhost)
php5, mySQL 5.0.18
April 20, 2006 at 1:05 pm
Hello,
Don’t know if it’s the right place to put this, but i have a few questions about the Zend search engine.. It seems to be working
very well so far, yet i’m frustrated about a couple of features i can’t get to get working – or maybe aren’t they implemented yet?
They are the following :
* score calculation :
(total number of terms contained in a doc), any idea?
– boost factors for fields and documents don’t seem to be stored in the index, or aren’t properly loaded at search time,
and/or aren’t used in the score calculation formula…
– the lengthNorm(fieldName, numTerms) method doesn’t seem to be called at any time, i’d be glad to add that factor myself, but
can’t catch a glimpse of numTerms value
* query types :
– is it possible to mix term / multiterm queries together with phrase queries, like in searches like "my phrase" + word1 + word2?
– how to search for a phrase in docs that contain several fields? (title + contents for instance)
Any suggestion will be more than welcome!
Thanks in advance,
–darma5
PS. About unsupported non-ASCII characters, I also used the htmlentities function before providing the index with any data for storage,
)
but it’s important to "clean" your text before indexing it, in any other language than english!
Thus I suggest : html_entity_decode + transform special chars (é=>e, è=>e, etc.) + strtolower your data on entering your analyzer’s
‘tokenize’ method. It’s too late to do that in your TokenFilter since ‘&’ and ‘;’ will break all your words that contain special characters
prior to the TokenFilter’s ‘normalize’ method… (don’t forget to apply the same filters to the searched terms
April 21, 2006 at 7:27 am
As the search module is currently a moving target, please submit support questions to the framework mailing list
June 26, 2006 at 7:19 am
hi
i tried zend lucene search and it’s working fine.but when i search for numbers inside a text it won’t work..any idea where i am missing..i am giving a smal coding below;
<?
require_once ‘Zend/Search/Lucene.php’;
$index = new Zend_Search_Lucene(‘index’, true);
$doc = new Zend_Search_Lucene_Document();
$data = "ann is 5 years old";
$doc->addField(Zend_Search_Lucene_Field::Text(‘contents’, $data));
$index->addDocument($doc);
$index->commit();
?>
<?
$index = new Zend_Search_Lucene(‘index’);
echo "Index contains {$index->count()} documents.\n";
$search = "ann"; //—INSTEAD OF THIS IF I GIVE 5 IT WILL RETUN ZERO RESULTS
$hits = $index->find($search);
foreach ($hits as $hit)
{
echo str_repeat(‘-’, 80) . "–<br>";
echo ‘ID: ‘ . $hit->id ."<br>";
echo ‘Score: ‘ . sprintf(‘%.2f’, $hit->score) ."<br>";
$document = $hit->getDocument();
echo $document->getFieldValue(‘contents’);
}
?>
How can i search for number 5 inside the text.
July 14, 2006 at 7:28 am
Hi everyone!
Im having problems trying to get this script works. When i run the script its creates 2 files ( segments of 20kb and deletable of 4kb) and then stop. No errors shows but when i hexaedit the created files they are always empty. It happens with every script i run to create an index. I cant figure out where the problem is !!! Please tell me what could be wrong !!! Thanks in advance.
Facundo.
July 22, 2006 at 3:00 am
Hi all,
When I run upon this search class, I recalled what I missed while writing those inneficient sql search queries. But there is always a but
$index = new Zend_Search_Lucene(‘test-index’,true);
$doc = new Zend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::UnIndexed(‘test_id’,’1′));
$doc->addField(Zend_Search_Lucene_Field::Text(‘test_name’,'something’));
$doc->addField(Zend_Search_Lucene_Field::UnStored(‘test_description’,'something other word’));
$index->addDocument($doc);
$doc = new Zend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::UnIndexed(‘test_id’,’2′));
$doc->addField(Zend_Search_Lucene_Field::Text(‘test_name’,'other bear’));
$doc->addField(Zend_Search_Lucene_Field::UnStored(‘test_description’,'something other creative’));
$index->addDocument($doc);
$doc = new Zend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::UnIndexed(‘test_id’,’3′));
$doc->addField(Zend_Search_Lucene_Field::Text(‘test_name’,'phprulez’));
$doc->addField(Zend_Search_Lucene_Field::UnStored(‘test_description’,'one php rules them all’));
$index->addDocument($doc);
$index->commit();
echo $index->count()." Documents indexed.\n";
$index = new Zend_Search_Lucene(‘test-index’);
$query = ‘something’;
$hits = $index->find($query);
foreach ($hits as $hit) {
echo $hit->score;
echo $hit->test_name;
echo $hit->test_id;
}
So when run this code I got no results. I checked index and documents with luke and they seem to be ok. Then after consulting help files I noticed that for some reason (default searchable) field name ‘contents’ is used. When I replaced ‘test_description’ with ‘contents’ I have got some results but only for searches with words for that mentioned field. I would like to make search on ‘test_name’ AND ‘test_description’, thats why I created it like Text field. I tried query builder API and that doesnt helped neither. It seems to me that it espects ‘contents’ field and willing to search only it. Any ideas?
August 7, 2006 at 8:11 am
Hi I started to use Lucene and I have a few problems.
- Only get results for contents (Unstored) field
- Title and category (Text) fields are ignored.
- When I open index files with luke I get "read past EOF" error, but I can browse indexes and search them
- Browse by term shows terms for title field As: "something wierd that this comment editor recognizes as XHTML tags so I can’t post it here" (question marks, part of some terms, boxes and non english alphabet characters)
PHP is 5.1.4 Zend core for oracle
Zend framework is 0.1.5
Any ideas? Is there some limitations with field length or something like that? What is "read past EOF"? Any pdf and/or Word parsers for Zend Search lucene
August 9, 2006 at 9:43 am
Hi there,
i would like to know whether Zend search is capable of indexing and searching .txt,.doc,.pdf,.xml files?
which are stored in the server.
pls do reply..
thanx a lot
..scan
September 26, 2006 at 9:25 am
Hello everybody,
I tested the example of this page and it works fine on my pc with the wampserver.
When I upload the data to my provider an try to search in the index I get an Error like this:
Parse error: syntax error, unexpected T_STRING, expecting T_OLD_FUNCTION or T_FUNCTION or T_VAR or ‘}’ in /home/sn/public_html/lucenetestphp/Zend/Search/Lucene.php on line 69
I have no idea what the reason fpr this problem is.
Greetings
December 15, 2006 at 1:50 am
http://www.ctrick.com, Hi I,m moises from panamna, I think that article is so good
December 31, 2006 at 2:59 pm
The default analyzer will treat all non alpha characters as "white space", skipping all numbers, turning words containing numbers into seperate words. For example Kuro5hin.org becomes Kuro hin org , so searching for Kuro5hin will return 0 hits.
In the Zend Framework manual, section 15 has an example of a custom analyzer that will treat words with digits as one term. [ <a href="http://framework.zend.com/manual/en/zend.search.extending.html#zend.search.extending.analysis" >link</a> ]
March 9, 2007 at 3:21 am
i read it from error logs file
any idea ?
[Fri Mar 9 11:11:07 2007] [error] PHP Warning: require_once(Zend/Search/Lucene.php) [<a href='function.require-once'>function.require-once</a>]: failed to open stream: No such file or directory in /home/tenfoon/public_html/PosterFeed1/lucene.php on line 5
[Fri Mar 9 11:11:07 2007] [error] PHP Fatal error: require_once() [<a href='function.require'>function.require</a>]: Failed opening required ‘Zend/Search/Lucene.php’ (include_path=’.:/usr/local/lib/php:/usr/local/lib/php/library’) in /home/tenfoon/public_html/PosterFeed1/lucene.php on line 5
September 28, 2007 at 1:11 am
I’m a newbie, so this might be obvious to many, but for other newb’s like me:
–If you use strip_tags() from the example, and have text like (w/o quotes) "birds<br>fly", you’ll end up with "birdsfly" in your search. "Birds" won’t be found. Code your own tag replacement.
–I had the same problem as a lot of others where it’s only searching the title field. Therefore I concatinated all fields I wanted to search to the title, like: title . " " . body . " " . tags. And used Unstored. Granted, I’m using a relational database, so the only thing I need to store is the key to reference the record in the database.
Best wishes!
October 31, 2007 at 6:05 pm
is it possible to customize the Zend_Search_Lucene scoring algorithm so it works more like this script:
http://www.iamcal.com/publish/articles/php/search/
each one of my index entries only contains on average 5-6 words. the default behavior of Zend_Search_Lucene has a very hard time with small amounts of text. For example, if i have index entries of:
Red Hook Pilsner
Samuel Adams Pilsner
Sierra Nevada Pilsner
Widmer Pilsner
and i search for "Red Hook Pilsner" chances are i’ll get one of the other entries like "Widmer Pilsner" as my first result. i just want the engine to work based on the principle of mysql RLIKE matches.
is this possible?
Tim
November 17, 2008 at 12:33 pm
All the best search engines piled into one. Including Google, Yahoo, sport search engines, science and medical search engines, encylopedia search engines, government and legal search engines, education search engines, news search engines, meta search engines…..
http://www.allthebestsearchengines.blogspot.com
December 10, 2008 at 4:56 pm
I’ve been using Lucene (java) for many years so I thought I would give it a try in PHP. Unfortunately I seem to be getting totally random results.
I’m basically using lucene to index the ‘name’ column of a database table i.e. i store the pk as a keyword and the name as an unstored field in the document. e.g.
$doc->addField(Zend_Search_Lucene_Field::Keyword(‘id’, $clipart['id']));
$doc->addField(Zend_Search_Lucene_Field::Text(‘name’, $clipart['name']));
log:
Dec 10 16:46:02 [info] indexing 282 : gift7
Dec 10 16:46:02 [info] indexing 283 : gift8
Dec 10 16:46:02 [info] indexing 284 : gift9
Dec 10 16:46:03 [info] indexing 416 : hat
Then I search against the index using a query of "name:hat"
but I get
Dec 10 16:46:53 search hits [debug] 282
I would expect to get a hit back for 416 (hat)
Has anyone else experienced something like this?
August 1, 2009 at 11:05 pm
recently microsoft invested 1 billion $ in their SE > BING, I beleive they will fignt with google for top SE.
working on <a href="http://www.bidbuy.ro">licitatii online</a>.
September 10, 2009 at 6:49 am
Nice good
<a href="www.yes.com>yes</a>