Apache Lucene is a high-performance, full-featured text search engine library.
Here's a simple example how to use Lucene for indexing and searching (using JUnit
to check if the results are what we expect):
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
// Store the index in memory:
Directory directory = new RAMDirectory();
// To store an index on disk, use this instead:
//Directory directory = FSDirectory.open("/tmp/testindex");
IndexWriter iwriter = new IndexWriter(directory, analyzer, true,
new IndexWriter.MaxFieldLength(25000));
Document doc = new Document();
String text = "This is the text to be indexed.";
doc.add(new Field("fieldname", text, Field.Store.YES,
Field.Index.ANALYZED));
iwriter.addDocument(doc);
iwriter.close();
// Now search the index:
IndexSearcher isearcher = new IndexSearcher(directory, true); // read-only=true
// Parse a simple query that searches for "text":
QueryParser parser = new QueryParser("fieldname", analyzer);
Query query = parser.parse("text");
ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs;
assertEquals(1, hits.length);
// Iterate through the results:
for (int i = 0; i < hits.length; i++) {
Document hitDoc = isearcher.doc(hits[i].doc);
assertEquals("This is the text to be indexed.", hitDoc.get("fieldname"));
}
isearcher.close();
directory.close();
|
The Lucene API is divided into several packages:
-
org.apache.lucene.analysis
defines an abstract Analyzer
API for converting text from a java.io.Reader
into a TokenStream,
an enumeration of token Attributes.
A TokenStream can be composed by applying TokenFilters
to the output of a Tokenizer.
Tokenizers and TokenFilters are strung together and applied with an Analyzer.
A handful of Analyzer implementations are provided, including StopAnalyzer
and the grammar-based StandardAnalyzer.
-
org.apache.lucene.document
provides a simple Document
class. A Document is simply a set of named Fields,
whose values may be strings or instances of java.io.Reader.
-
org.apache.lucene.index
provides two primary classes: IndexWriter,
which creates and adds documents to indices; and IndexReader,
which accesses the data in the index.
-
org.apache.lucene.search
provides data structures to represent queries (ie TermQuery
for individual words, PhraseQuery
for phrases, and BooleanQuery
for boolean combinations of queries) and the abstract Searcher
which turns queries into TopDocs.
IndexSearcher
implements search over a single IndexReader.
-
org.apache.lucene.queryParser
uses JavaCC to implement a
QueryParser.
-
org.apache.lucene.store
defines an abstract class for storing persistent data, the Directory,
which is a collection of named files written by an IndexOutput
and read by an IndexInput.
Multiple implementations are provided, including FSDirectory,
which uses a file system directory to store files, and RAMDirectory
which implements files as memory-resident data structures.
-
org.apache.lucene.util
contains a few handy data structures and util classes, ie BitVector
and PriorityQueue.
To use Lucene, an application should:
-
Create Documents by
adding
Fields;
-
Create an IndexWriter
and add documents to it with addDocument();
-
Call QueryParser.parse()
to build a query from a string; and
-
Create an IndexSearcher
and pass the query to its search()
method.
Some simple examples of code which does this are:
To demonstrate these, try something like:
> java -cp lucene.jar:lucene-demo.jar org.apache.lucene.demo.IndexFiles rec.food.recipes/soups
adding rec.food.recipes/soups/abalone-chowder
[ ... ]
> java -cp lucene.jar:lucene-demo.jar org.apache.lucene.demo.SearchFiles
Query: chowder
Searching for: chowder
34 total matching documents
1. rec.food.recipes/soups/spam-chowder
[ ... thirty-four documents contain the word "chowder" ... ]
Query: "clam chowder" AND Manhattan
Searching for: +"clam chowder" +manhattan
2 total matching documents
1. rec.food.recipes/soups/clam-chowder
[ ... two documents contain the phrase "clam chowder"
and the word "manhattan" ... ]
[ Note: "+" and "-" are canonical, but "AND", "OR"
and "NOT" may be used. ]
The
IndexHTML demo is more sophisticated.
It incrementally maintains an index of HTML files, adding new files as
they appear, deleting old files as they disappear and re-indexing files
as they change.
> java -cp lucene.jar:lucene-demo.jar org.apache.lucene.demo.IndexHTML -create java/jdk1.1.6/docs/relnotes
adding java/jdk1.1.6/docs/relnotes/SMICopyright.html
[ ... create an index containing all the relnotes ]
> rm java/jdk1.1.6/docs/relnotes/smicopyright.html
> java -cp lucene.jar:lucene-demo.jar org.apache.lucene.demo.IndexHTML java/jdk1.1.6/docs/relnotes
deleting java/jdk1.1.6/docs/relnotes/SMICopyright.html