Lucene Tutorial

Lucene is an extremely rich and powerful full-text search library written in Java. You can use Lucene to provide full-text indexing across both database objects and documents in various formats (Microsoft Office documents, PDF, HTML, text, and so on). In this tutorial, we'll go through the basics of using Lucene to add full-text search functionality to a fairly typical J2EE application: an online accommodation database. The main business object is the Hotel class. In this tutorial, a Hotel has a unique identifier, a name, a city, and a description.

Roughly, supporting full-text search using Lucene requires two steps: (1) creating a lucence index on the documents and/or database objects and (2) parsing the user query and looking up the prebuilt index to answer the query. In the first part of this tutorial, we learn how to create a lucene index. In the second part, we learn how to use the prebuilt index to answer user queries.

For your convenience, all of the code for this article's Lucene demo is included in the lucene-tutorial.zip file. In this demo, the class Indexer in src/lucene/demo/search/Indexer.java is responsible for creating the index. The class SearchEngine in src/lucene/demo/search/SearchEngine.java is responsible for supporting user queries. The class Main in src/lucene/demo/Main.java has a test code that builds a Lucene index using a small dataset (the actual data is provided by the Hotel class stored in src/lucene/demo/business/HotelDatabase.java) and performs a simple keyword query on the data using the index. Briefly go over the two java source files, Indexer.java and SearchEngine.java, to get yourself familiar with the overall structure of the code.

1. Creating an Index

The first step in implementing full-text searching with Lucene is to build an index. Here's a simple attempt to diagram how the Lucene classes go together when you create an index:

Index

Document 1

Field A (name/value)

Field B (name/value)

Document 2

Field A (name/value)

Field B (name/value)

At the heart of Lucene is an Index. You pump your data into the Index, then do searches on the Index to get results out. Document objects are stored in the Index, and it is your job to "convert" your data into Document objects and store them to the Index. That is, you read in each data file (or Web document, database tuple or whatever), instantiate a Document for it, break down the data into chunks and store the chunks in the Document as Field objects (a name/value pair). When you're done building a Document, you write it to the Index using the IndexWriter. Now let us get into details on how this is done.

1.1 IndexWriter Class: Creating Index

To create an index, the first thing that need to do is to create an IndexWriter object. The IndexWriter object is used to create the index and to add new index entries (i.e., Documents) to this index. You can create an IndexWriter as follows:

Directory indexDir = FSDirectory.open(new File("index-directory"));
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_4_10_2, new StandardAnalyzer());
IndexWriter indexWriter = new IndexWriter(indexDir, config);

Note that IndexWriter takes two parameters, indexDir and config, which are Directory and IndexWriterConfig objects, respectively. The first parameter, indexDir specifies the directory in which the Lucene index will be created, which is index-directory in this case. The second parameter specifies the "configuration" of our index, which are the version of our Lucene library (4.10.2) and the "document analyzer" to be used when Lucene indexes your data. Here, we are using the StandardAnalyzer for this purpose. More details on lucene analyzers follow shortly.

1.2 Analyzer Class: Parsing the Documents

Most likely, the data that you want to index by Lucene is plain text English. The job of Analyzer is to "parse" each field of your data into indexable "tokens" or keywords. Several types of analyzers are provided out of the box. Table 1 shows some of the more interesting ones.

Table 1 Lucene analyzers.

Analyzer	Description
`StandardAnalyzer`	A sophisticated general-purpose analyzer.
`WhitespaceAnalyzer`	A very simple analyzer that just separates tokens using white space.
`StopAnalyzer`	Removes common English words that are not usually useful for indexing.
`SnowballAnalyzer`	An interesting experimental analyzer that works on word roots (a search on rain should also return entries with raining, rained, and so on).

There are even a number of language-specific analyzers, including analyzers for German, Russian, French, Dutch, and others.

It isn't difficult to implement your own analyzer, though the standard ones often do the job well enough. When you create an IndexWriter, you have to specify which Analyzer you will use for the index as we did before. In our previous example, we used the StandardAnalyzer as the document analyzer.

1.3 Adding a Document/object to Index

Now you need to index your documents or business objects. To index an object, you use the Lucene Document class, to which you add the fields that you want indexed. As we briefly mentioned before, a Lucene Document is basically a container for a set of indexed fields. This is best illustrated by an example:

Document doc = new Document();
doc.add(new StringField("id", "Hotel-1345", Field.Store.YES));
doc.add(new TextField("description", "A beautiful hotel", Field.Store.YES));

In the above example, we add two fields, "id" and "description", with the respective values "Hotel-1345" and "A beautiful hotel" to the document.

More precisely, to add a field to a document, you create a new instance of the Field class, which can be either a StringField or a TextField (the difference between the two will be explained shortly). A field object takes the following three parameters:

Field name: This is the name of the field. In the above example, they are "id" and "description".
Field value: This is the value of the field. In the above example, they are "Hotel-1345" and "A beautiful hotel". A value can be a String like our example or a Reader if the object to be indexed is a file.
Storage flag: The third parameter specifies whether the actual value of the field needs to be stored in the lucene index or it can be discarded after it is indexed. Storing the value is useful if you need the value later, like you want to display it in the search result list or you use the value to look up a tuple from a database table, for example. If the value must be stored, use Field.Store.YES. You can also use Field.Store.COMPRESS for large documents or binary value fields. If you don't need to store the value, use Field.Store.NO.

StringField vs TextField: In the above example, the "id" field contains the ID of the hotel, which is a single atomic value. In contrast, the "description" field contains an English text, which should be parsed (or "tokenized") into a set of words for indexing. Use StringField for a field with an atomic value that should not be tokenized. Use TextField for a field that needs to be tokenized into a set of words.

For our hotel example, we just want some fairly simple full-text searching. So we add the following fields:

The hotel identifier (or the key to the hotel tuple), so we can retrieve the corresponding hotel object from the database later once we obtain the query result list from the Lucene index.
The hotel name, which we need to display in the query result lists.
The hotel city, if we need to display this information in the query result lists.
Composite text containing the important fields of the Hotel object:
- Hotel name
- Hotel city
- Hotel description
We want full-text indexing on this field. We don't need to display the indexed text in the query results, so we use Field.Store.NO to save index space.

Here's the method in the Indexer class in our demo that indexes a given hotel:

public void indexHotel(Hotel hotel) throws IOException {
    IndexWriter writer = getIndexWriter(false);
    Document doc = new Document();
    doc.add(new StringField("id", hotel.getId(), Field.Store.YES));
    doc.add(new StringField("name", hotel.getName(), Field.Store.YES));
    doc.add(new StringField("city", hotel.getCity(), Field.Store.YES));
    String fullSearchableText = hotel.getName() + " " + hotel.getCity() + " " + hotel.getDescription();
    doc.add(new TextField("content", fullSearchableText, Field.Store.NO));
    writer.addDocument(doc);
}

Once the indexing is finished, you have to close the index writer, which updates and closes the associated files on the disk. Opening and closing the index writer is time-consuming, so it's not a good idea to do it systematically for each operation in the case of batch updates. For example, here's a method in the Indexer class in our demo that rebuilds the whole index:

public void rebuildIndexes() throws IOException {
   //
   // Erase existing index
   //
   getIndexWriter(true);
   //
   // Index all hotel entries
   //
   Hotel[] hotels = HotelDatabase.getHotels();
   for(Hotel hotel: hotels) {
     indexHotel(hotel);
   }
   //
   // Don't forget to close the index writer when done
   //
   closeIndexWriter();
 }

For your reference, here is complete source code of the src/lucene/demo/search/Indexer.java.

package lucene.demo.search;

import java.io.IOException;
import java.io.StringReader;
import java.io.File;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

import lucene.demo.business.Hotel;
import lucene.demo.business.HotelDatabase;

public class Indexer {

    /** Creates a new instance of Indexer */
    public Indexer() {
    }

    private IndexWriter indexWriter = null;

    public IndexWriter getIndexWriter(boolean create) throws IOException {
        if (indexWriter == null) {
            Directory indexDir = FSDirectory.open(new File("index-directory"));
            IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_4_10_2, new StandardAnalyzer());
            indexWriter = new IndexWriter(indexDir, config);
        }
        return indexWriter;
   }

    public void closeIndexWriter() throws IOException {
        if (indexWriter != null) {
            indexWriter.close();
        }
   }

    public void indexHotel(Hotel hotel) throws IOException {

        System.out.println("Indexing hotel: " + hotel);
        IndexWriter writer = getIndexWriter(false);
        Document doc = new Document();
        doc.add(new StringField("id", hotel.getId(), Field.Store.YES));
        doc.add(new StringField("name", hotel.getName(), Field.Store.YES));
        doc.add(new StringField("city", hotel.getCity(), Field.Store.YES));
        String fullSearchableText = hotel.getName() + " " + hotel.getCity() + " " + hotel.getDescription();
        doc.add(new TextField("content", fullSearchableText, Field.Store.NO));
        writer.addDocument(doc);
    }

    public void rebuildIndexes() throws IOException {
          //
          // Erase existing index
          //
          getIndexWriter(true);
          //
          // Index all Accommodation entries
          //
          Hotel[] hotels = HotelDatabase.getHotels();
          for(Hotel hotel : hotels) {
              indexHotel(hotel);
          }
          //
          // Don't forget to close the index writer when done
          //
          closeIndexWriter();
     }
}

2. Text Search Using Lucene Index

Now that we've indexed our data, we can do some searching. In our demo, this part is implemented by the SearchEngine class in src/lucene/demo/search/SearchEngine.java.

In most cases, you need to use two classes to support full-text searching: QueryParser and IndexSearcher. QueryParser parses the user query string and constructs a Lucene Query object, which is passed on to IndexSearcher.search() as the input. Based on this Query object and the prebuilt Lucene index, IndexSearcher.search() identifies the matching documents and returns them as an TopDocs objects in the result. To get started, look at the following example code.

package lucene.demo.search;

import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.File;
import java.util.ArrayList;
import java.util.List;

import org.apache.lucene.document.Document;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

import lucene.demo.business.Hotel;
import lucene.demo.business.HotelDatabase;

public class SearchEngine {
    private IndexSearcher searcher = null;
    private QueryParser parser = null;

    /** Creates a new instance of SearchEngine */
    public SearchEngine() throws IOException {
        searcher = new IndexSearcher(DirectoryReader.open(FSDirectory.open(new File("index-directory"))));
        parser = new QueryParser("content", new StandardAnalyzer());
    }

    public TopDocs performSearch(String queryString, int n)
    throws IOException, ParseException {
        Query query = parser.parse(queryString);
        return searcher.search(query, n);
    }

    public Document getDocument(int docId)
    throws IOException {
        return searcher.doc(docId);
    }
}

Inside the constructor of SearchEngine, we first create an IndexSearcher object using the index in index-directory that we created before. We then create a QueryParser. The first parameter to the QueryParser constructor specifies the default search field, which is content field in this case. This default field is used if the query string does not specify the search field. The second parameter specifies the Analyzer to be used when the QueryParser parses the user query string.

The class SearchEngine provides a method called performSearch which takes a query string and the maximum number of matching documents that should be returned as the input parameters and returns the list of matching documents as a Lucene TopDocs object. The method takes the query string, parses it using QueryParser and performs search() using IndexSearcher.

Important Note: There's a very common mistakes that people often make, so I have to mention it here. When you use Lucene, you have to specify the Analyzer twice, once when you create an IndexWriter object (for index construction) and once more when you create a QueryParser (for query parsing). Please note that it is extremely important that you use the same analyzer for both. In our example, since we created IndexWriter using StandardAnalyzer before, we are also passing StandardAnalyzer to QueryParser. Otherwise, you will get into all sorts of problems that you do not expect.

The last method getDocument of the SearchEngine class takes the unique ID of a document and returns the corresponding Document object from the index. This method is used to retrieve a particular matching document from the index.

Now we briefly explain the syntax of the user's query string.

2.1 Query Syntax

In the simpliest form, the query string can be a simple list of keywords like Mariott Hotel. This query will return the documents that contain either Mariott or Hotel in the default field (i.e., the content field in our example). If you want to search for documents that contain both keywords, the query should be Mariott AND Hotel. Note that AND boolean operator must be ALL CAPS.

The general syntax for a query string is as follows: A query is a series of clauses. A clause may be prefixed by:

a plus (+) or a minus (-) sign, indicating that the clause is required or prohibited respectively; or
a field name followed by a colon, indicating the search field. This enables one to construct a query on multiple search fields.

A clause may be either:

a keyword, indicating all the documents that contain this keyword; or
a nested query, enclosed in parentheses.

For example, the following query string will search for "Mariott" in the name field or "Comfortable" in the description field:

  name:Mariott OR description:Comfortable

The following query will search for a hotel that contains both the words "Mariott" and "Resort" in the name field:

name:(+Mariott +Resort)

More examples of query strings can be found in the query syntax documentation.

2.2 Retrieving Matching Documents

The search() function of the Lucene IndexSearcher object returns the list of matching document information as a Lucene TopDocs object. This object contains a list of ScoreDoc objects in the scoreDocs field, which, in turn, has the doc field (the unique document ID of the matching document) and the score field (the document's relevance score). More precisely, from the TopDocs object you can obtain the matching Document objects as follows:

// instantiate the search engine
SearchEngine se = new SearchEngine();

// retrieve top 100 matching document list for the query "Notre Dame museum"
TopDocs topDocs = se.performSearch("Notre Dame museum", 100); 

// obtain the ScoreDoc (= documentID, relevanceScore) array from topDocs
ScoreDoc[] hits = topDocs.scoreDocs;

// retrieve each matching document from the ScoreDoc arry
for (int i = 0; i < hits.length; i++) {
    Document doc = instance.getDocument(hits[i].doc);
    String hotelName = doc.get("name");
   ...
}

As in this example, once you obtain the Document object from the index, you can use the get() method to fetch field values that have been stored during indexing.

Now read the src/lucene/demo/Main.java file to see how it builds, search, and retrieve from a Lucene index.

Notes on CLASSPATH

In order to use Lucene, you need the lucene-*.jar library files available in the /usr/share/java directory of our VM. Since this is a third-party jar library file that is not part of the standard Java Runtime environment, the Java compiler and runtime engine are NOT aware of this file and may generate "class not found" error when you try to compile and run your code. To avoid this error you have to make sure one of the following:

Your ant script must pass the jar file as the classpath parameter during compilation and runtime. The included build.xml file in lucene-tutorial.zip does this automatically for the two targets "compile" and "run".

If you run javac and java commands directly from a shell, pass the locations of the libraries (separated by :) using the -classpath option like
```
javac -classpath ".:/usr/share/java/*.jar" YourClass.java
```
and
```
java -classpath ".:/usr/share/java/*.jar" YourClass
```

This method is strongly discouraged, but it still works. You can set your environment variable CLASSPATH to include the library files.

Summary and References

There is much more to Lucene than is described here. In fact, we barely scratched the surface. However, this example does show how easy it is to implement full-text search functions in a Java database application. Try it out, and add some powerful full-text search functions to your web site today!

Lucene web site
Lucene in Action (Manning, 2004), by Erik Hatcher and Otis Gospodnetic

A Short Introduction to Lucene