Lucene is an extremely rich and powerful full-text search library written in Java. You can use Lucene to provide full-text indexing across both database objects and documents in various formats (Microsoft Office documents, PDF, HTML, text, and so on). In this tutorial, we'll go through the basics of using Lucene to add full-text search functionality to a fairly typical J2EE application: an online accommodation database. The main business object is the Hotel class. In this tutorial, a Hotel has a unique identifier, a name, a city, and a description.
Roughly, supporting full-text search using Lucene requires two steps: (1) creating a lucence index on the documents and/or database objects and (2) parsing the user query and looking up the prebuilt index to answer the query. In the first part of this tutorial, we learn how to create a lucene index. In the second part, we learn how to use the prebuilt index to answer user queries.
For your convenience, all of the code for this article's Lucene demo is included in the lucene-tutorial.zip file. In this demo, the class Indexer in src/lucene/demo/search/Indexer.java is responsible for creating the index. The class SearchEngine in src/lucene/demo/search/SearchEngine.java is responsible for supporting user queries. The class Main in src/lucene/demo/Main.java has a test code that builds a Lucene index using a small dataset (the actual data is provided by the Hotel class stored in src/lucene/demo/business/HotelDatabase.java) and performs a simple keyword query on the data using the index. Briefly go over the two java source files, Indexer.java and SearchEngine.java, to get yourself familiar with the overall structure of the code.
The first step in implementing full-text searching with Lucene is to build an index. Here's a simple attempt to diagram how the Lucene classes go together when you create an index:
Index | |||||||
|
|
At the heart of Lucene is an Index. You pump your data into the Index, then do searches on the Index to get results out. Document objects are stored in the Index, and it is your job to "convert" your data into Document objects and store them to the Index. That is, you read in each data file (or Web document, database tuple or whatever), instantiate a Document for it, break down the data into chunks and store the chunks in the Document as Field objects (a name/value pair). When you're done building a Document, you write it to the Index using the IndexWriter. Now let us get into details on how this is done.
To create an index, the first thing that need to do is to create an IndexWriter object. The IndexWriter object is used to create the index and to add new index entries (i.e., Documents) to this index. You can create an IndexWriter as follows:
Directory indexDir = FSDirectory.open(new File("index-directory")); IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_4_10_2, new StandardAnalyzer()); IndexWriter indexWriter = new IndexWriter(indexDir, config);Note that IndexWriter takes two parameters, indexDir and config, which are Directory and IndexWriterConfig objects, respectively. The first parameter, indexDir specifies the directory in which the Lucene index will be created, which is index-directory in this case. The second parameter specifies the "configuration" of our index, which are the version of our Lucene library (4.10.2) and the "document analyzer" to be used when Lucene indexes your data. Here, we are using the StandardAnalyzer for this purpose. More details on lucene analyzers follow shortly.
Analyzer | Description |
StandardAnalyzer | A sophisticated general-purpose analyzer. |
WhitespaceAnalyzer | A very simple analyzer that just separates tokens using white space. |
StopAnalyzer | Removes common English words that are not usually useful for indexing. |
SnowballAnalyzer | An interesting experimental analyzer that works on word roots (a search on rain should also return entries with raining, rained, and so on). |
There are even a number of language-specific analyzers, including analyzers for German, Russian, French, Dutch, and others.
It isn't difficult to implement your own analyzer, though the standard ones often do the job well enough. When you create an IndexWriter, you have to specify which Analyzer you will use for the index as we did before. In our previous example, we used the StandardAnalyzer as the document analyzer.
Now you need to index your documents or business objects. To index an object, you use the Lucene Document class, to which you add the fields that you want indexed. As we briefly mentioned before, a Lucene Document is basically a container for a set of indexed fields. This is best illustrated by an example:
Document doc = new Document(); doc.add(new StringField("id", "Hotel-1345", Field.Store.YES)); doc.add(new TextField("description", "A beautiful hotel", Field.Store.YES));
In the above example, we add two fields, "id" and "description", with the respective values "Hotel-1345" and "A beautiful hotel" to the document.
More precisely, to add a field to a document, you create a new instance of the Field class, which can be either a StringField or a TextField (the difference between the two will be explained shortly). A field object takes the following three parameters:
StringField vs TextField: In the above example, the "id" field contains the ID of the hotel, which is a single atomic value. In contrast, the "description" field contains an English text, which should be parsed (or "tokenized") into a set of words for indexing. Use StringField for a field with an atomic value that should not be tokenized. Use TextField for a field that needs to be tokenized into a set of words.
For our hotel example, we just want some fairly simple full-text searching. So we add the following fields:
We want full-text indexing on this field. We don't need to display the indexed text in the query results, so we use Field.Store.NO to save index space.
Here's the method in the Indexer class in our demo that indexes a given hotel:
public void indexHotel(Hotel hotel) throws IOException { IndexWriter writer = getIndexWriter(false); Document doc = new Document(); doc.add(new StringField("id", hotel.getId(), Field.Store.YES)); doc.add(new StringField("name", hotel.getName(), Field.Store.YES)); doc.add(new StringField("city", hotel.getCity(), Field.Store.YES)); String fullSearchableText = hotel.getName() + " " + hotel.getCity() + " " + hotel.getDescription(); doc.add(new TextField("content", fullSearchableText, Field.Store.NO)); writer.addDocument(doc); }
Once the indexing is finished, you have to close the index writer, which updates and closes the associated files on the disk. Opening and closing the index writer is time-consuming, so it's not a good idea to do it systematically for each operation in the case of batch updates. For example, here's a method in the Indexer class in our demo that rebuilds the whole index:
public void rebuildIndexes() throws IOException { // // Erase existing index // getIndexWriter(true); // // Index all hotel entries // Hotel[] hotels = HotelDatabase.getHotels(); for(Hotel hotel: hotels) { indexHotel(hotel); } // // Don't forget to close the index writer when done // closeIndexWriter(); }For your reference, here is complete source code of the src/lucene/demo/search/Indexer.java.
package lucene.demo.search; import java.io.IOException; import java.io.StringReader; import java.io.File; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.document.StringField; import org.apache.lucene.document.TextField; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.IndexWriterConfig; import org.apache.lucene.store.Directory; import org.apache.lucene.store.FSDirectory; import org.apache.lucene.util.Version; import lucene.demo.business.Hotel; import lucene.demo.business.HotelDatabase; public class Indexer { /** Creates a new instance of Indexer */ public Indexer() { } private IndexWriter indexWriter = null; public IndexWriter getIndexWriter(boolean create) throws IOException { if (indexWriter == null) { Directory indexDir = FSDirectory.open(new File("index-directory")); IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_4_10_2, new StandardAnalyzer()); indexWriter = new IndexWriter(indexDir, config); } return indexWriter; } public void closeIndexWriter() throws IOException { if (indexWriter != null) { indexWriter.close(); } } public void indexHotel(Hotel hotel) throws IOException { System.out.println("Indexing hotel: " + hotel); IndexWriter writer = getIndexWriter(false); Document doc = new Document(); doc.add(new StringField("id", hotel.getId(), Field.Store.YES)); doc.add(new StringField("name", hotel.getName(), Field.Store.YES)); doc.add(new StringField("city", hotel.getCity(), Field.Store.YES)); String fullSearchableText = hotel.getName() + " " + hotel.getCity() + " " + hotel.getDescription(); doc.add(new TextField("content", fullSearchableText, Field.Store.NO)); writer.addDocument(doc); } public void rebuildIndexes() throws IOException { // // Erase existing index // getIndexWriter(true); // // Index all Accommodation entries // Hotel[] hotels = HotelDatabase.getHotels(); for(Hotel hotel : hotels) { indexHotel(hotel); } // // Don't forget to close the index writer when done // closeIndexWriter(); } }
Now that we've indexed our data, we can do some searching. In our demo, this part is implemented by the SearchEngine class in src/lucene/demo/search/SearchEngine.java.
In most cases, you need to use two classes to support full-text searching: QueryParser and IndexSearcher. QueryParser parses the user query string and constructs a Lucene Query object, which is passed on to IndexSearcher.search() as the input. Based on this Query object and the prebuilt Lucene index, IndexSearcher.search() identifies the matching documents and returns them as an TopDocs objects in the result. To get started, look at the following example code.
package lucene.demo.search; import java.io.FileNotFoundException; import java.io.IOException; import java.io.File; import java.util.ArrayList; import java.util.List; import org.apache.lucene.document.Document; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.index.DirectoryReader; import org.apache.lucene.index.IndexReader; import org.apache.lucene.queryparser.classic.QueryParser; import org.apache.lucene.queryparser.classic.ParseException; import org.apache.lucene.search.Query; import org.apache.lucene.search.TopDocs; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.store.Directory; import org.apache.lucene.store.FSDirectory; import org.apache.lucene.util.Version; import lucene.demo.business.Hotel; import lucene.demo.business.HotelDatabase; public class SearchEngine { private IndexSearcher searcher = null; private QueryParser parser = null; /** Creates a new instance of SearchEngine */ public SearchEngine() throws IOException { searcher = new IndexSearcher(DirectoryReader.open(FSDirectory.open(new File("index-directory")))); parser = new QueryParser("content", new StandardAnalyzer()); } public TopDocs performSearch(String queryString, int n) throws IOException, ParseException { Query query = parser.parse(queryString); return searcher.search(query, n); } public Document getDocument(int docId) throws IOException { return searcher.doc(docId); } }Inside the constructor of SearchEngine, we first create an IndexSearcher object using the index in index-directory that we created before. We then create a QueryParser. The first parameter to the QueryParser constructor specifies the default search field, which is content field in this case. This default field is used if the query string does not specify the search field. The second parameter specifies the Analyzer to be used when the QueryParser parses the user query string.
The class SearchEngine provides a method called performSearch which takes a query string and the maximum number of matching documents that should be returned as the input parameters and returns the list of matching documents as a Lucene TopDocs object. The method takes the query string, parses it using QueryParser and performs search() using IndexSearcher.
Important Note: There's a very common mistakes that people often make, so I have to mention it here. When you use Lucene, you have to specify the Analyzer twice, once when you create an IndexWriter object (for index construction) and once more when you create a QueryParser (for query parsing). Please note that it is extremely important that you use the same analyzer for both. In our example, since we created IndexWriter using StandardAnalyzer before, we are also passing StandardAnalyzer to QueryParser. Otherwise, you will get into all sorts of problems that you do not expect.
The last method getDocument of the SearchEngine class takes the unique ID of a document and returns the corresponding Document object from the index. This method is used to retrieve a particular matching document from the index.
Now we briefly explain the syntax of the user's query string.
The general syntax for a query string is as follows: A query is a series of clauses. A clause may be prefixed by:
name:Mariott OR description:ComfortableThe following query will search for a hotel that contains both the words "Mariott" and "Resort" in the name field:
name:(+Mariott +Resort)More examples of query strings can be found in the query syntax documentation.
The search() function of the Lucene IndexSearcher object returns the list of matching document information as a Lucene TopDocs object. This object contains a list of ScoreDoc objects in the scoreDocs field, which, in turn, has the doc field (the unique document ID of the matching document) and the score field (the document's relevance score). More precisely, from the TopDocs object you can obtain the matching Document objects as follows:
// instantiate the search engine SearchEngine se = new SearchEngine(); // retrieve top 100 matching document list for the query "Notre Dame museum" TopDocs topDocs = se.performSearch("Notre Dame museum", 100); // obtain the ScoreDoc (= documentID, relevanceScore) array from topDocs ScoreDoc[] hits = topDocs.scoreDocs; // retrieve each matching document from the ScoreDoc arry for (int i = 0; i < hits.length; i++) { Document doc = instance.getDocument(hits[i].doc); String hotelName = doc.get("name"); ... }
As in this example, once you obtain the Document object from the index, you can use the get() method to fetch field values that have been stored during indexing.
Now read the src/lucene/demo/Main.java file to see how it builds, search, and retrieve from a Lucene index.
Notes on CLASSPATH
In order to use Lucene, you need the lucene-*.jar library files available in the /usr/share/java directory of our VM. Since this is a third-party jar library file that is not part of the standard Java Runtime environment, the Java compiler and runtime engine are NOT aware of this file and may generate "class not found" error when you try to compile and run your code. To avoid this error you have to make sure one of the following:
javac -classpath ".:/usr/share/java/*.jar" YourClass.javaand
java -classpath ".:/usr/share/java/*.jar" YourClass
There is much more to Lucene than is described here. In fact, we barely scratched the surface. However, this example does show how easy it is to implement full-text search functions in a Java database application. Try it out, and add some powerful full-text search functions to your web site today!