Hit Highlighting with Lucene

by Hugo Burm

Introduction

Once you have your search engine up and running, one of the first things your customer will ask for is "hit highlighting" of the keywords that started the search in the search results.

This document explains how I created a simple Cocoon 2.x Transformer that adds highlighting of these keywords in the search result. The task of the Transformer is to find the keywords in the character stream and surround them with highlight tags. The final xslt tranformation should tranform these tags to appropriate highlighting for the serialized document (e.g. transform the tags to <b> tags if you are serializing to html.

People reading this document should have some basic knowledge about Cocoon, XML, and search engines.

This document is based on the work of Maik Schreiber (http://www.iq-computing.de).

History of the Project

  1. I used Cocoon 1.1.8 to publish 100000+ xml documents on a website. I used sf_freeWais as a search engine.
  2. I upgraded Cocoon to 2.0.2 and replaced the search engine with Lucene. I created my own Lucene document that matches my XML documents and I am bypassing Cocoon when creating the index. This environment is explained in section "Searching with Lucene" below. You must have a working system before you try to add hit highlighting.
  3. I added hit highlighting. See section Hit Highlighting

You can find the examples here: lucene.zip

Searching with Lucene

This section describes how to build a system with Cocoon and Lucene. It is brief because the Lucene and Cocoon documentation already give you a lot of information about how to build a system like this.

  1. Build a Lucene index
    This step has nothing to do with Cocoon. Only Lucene is involved. You have to build a Lucene index. First, you must create a Lucene document class that matches the XML documents that you want to index. See the Lucene documents and tutorials. In my case, I used an XML parser and stored, apart from the title, part of the document (first 100 words of the body) into the Lucene index in order to create a hitlist where I could show the title and the first lines of the document. I have included the Java classes I am using and an example XML document in the map index_example.
  2. Perform the search and show the hits
    Build a form in Cocoon asking the user to enter one or more keywords. I created a logicsheet (lucene_logicsheet.xsl) and a very simple helper class in Java (LuceneHelper.java) that passes the query to Lucene and receives the hits in a Lucene hitlist vector. >From this vector you build a hitlist page. The hitlist in the example shows ten titles with an URL pointing to the XML document.
  3. Show the target
    When a user clicks on one of the URL's (titles), open the URL and display the XML document. The examples search.xml and search.xsl perform the search and display the hitlist. You can enter the search term as a get paramter: search.xml?query=words_to_look_for.

Hit Highlighting

This section describes the modifications to the steps listed above that are needed to add hit highlighting. Before you enter this section you should have a working system that can do a search and display the results.

  1. No modifications to the step "Build a Lucene index"
  2. To the step "Perform the search and show the hits": you have to add
    ?query=list_of_keywords
    to the URL's in the hitlist. List_of_keywords is the phrase the user entered when he started the search.
  3. Show the target. This is the real work.

    Download and compile Lucene
    • Download the Lucene source. You need at least RC4.
    • Download JavaCC. Visit the Webgain site (http://www.webgain.com/download/javacc/details.html), start your personal random generator to supply all your personal details, and download JavaCC. The jar file you are looking for is probably hidden in the jar file you just downloaded. It is a jar in a jar. This one took me half a day to find out....
    • Download Ant (http://jakarta.apache.org). You also need the optional jar. Install it.
    • Compile Lucene. Now you should have the Lucene jar without any modifications. If you encounter problems, have a look at the Lucene faq and mailinglist.

    Download the document about Lucene hit highlighting by Maik Schreiber from http://www.iq-computing.de.
    • Install the Java classes (Lucene-tools) that come with the document in /cocoon/WEB-INF/classes
    • Apply patches described in the document. These are very simple modifications to the existing Lucene source code. You have to change a few class methods from private to public and add a few "get" methods.
    • Compile Lucene again and move the jar you just created to /cocoon/WEB-INF/lib. I renamed the new Lucene jar to the name that is already present in this lib (at the moment of writing lucene-1.2-rc2.jar)
  4. Some more work to do.
    • Compile the Highlight Transformer java class

      Two points of interest:

      • In the function setup, I retreive the query (keywords):

            final Request request
                = ObjectModelHelper.getRequest(objectModel);
            query = QueryParser.parse(request.getParameter("query"),
            	"text", myAnalyzer);     
        

        The variables query and myAnalyzer are Lucene classes defined as class variables:

            Query query;
            StopAnalyzer myAnalyzer = new StopAnalyzer();
        

      • In the function characters, I insert the highlight tags. First look at the comment near the top of this function:

            /* 
                The code below is taken from LuceneTools.highlightTerms
                from the "de.iqcomputing.lucene" package by Maik Schrieber.
                The HTMLTermHighlighter which implements TermHighlighter
                is NOT used.
                Because adding <b> etc. to the string is translated
                into &lt;b&gt; etc. So we have to insert our own
                SAX events...
            */
        

        Then, examine the piece of code that is modified:

                // does query contain current token?
                if (terms.contains(token.termText())) {          
                  super.contentHandler.characters(
                      newText.toString().toCharArray(), 0, newText.length());
                  newText.setLength(0);
                  
                  AttributesImpl attr = new AttributesImpl();
                  super.contentHandler.startElement(
                      "", emtag, emtag, attr);
                  super.contentHandler.characters(
                      tokenText.toCharArray(), 0, tokenText.length());
                  super.contentHandler.endElement("", emtag, emtag);
        
                }
                else {
                  newText.append(tokenText);
                }
        

        The String emtag is initialized with the highlight tag, e.g. "em".

    • Copy the compiled class to /cocoon/WEB-INF/classes
  5. Almost finished.

    Modify the sitemap:

    • Add the following line in the <map:transformers> section:

         <map:transformer
            logger="sitemap.transformer.highlight"
            name="highlight" 
         pool-grow="2" pool-max="16" pool-min="2" 
         src="nl.datagram.cocoon.transformation.HighlightTransformer"/>
      

    • Create the pipeline:

        <map:match pattern="datagram/kha/archief/txt/**.xml"> 
          <map:generate type="file"
              src="datagram/kha/archief/txt/{1}.xml"/> 
          <map:transform type="highlight" />
          <map:transform type="xslt"
              src="datagram/kha/archief/archief.xsl"/>
          <map:serialize/> 
        </map:match>
      

    • Make sure your final xslt tranformation (in my case archief.xsl) catches the "em" tag and produces some real HTML highlighting.

Finished! Your keywords should be highlighted now.

Comments, Bugs, and Improvements

  • I did not use the Lucene system that is integrated in Cocoon because:
    1) I started this before this was included in Cocoon.
    2) I don't need a crawler. I have to index 100000+ xml files. So I wanted to bypass Cocoon and index the files directly. Although the penalty of using a Cocoon view may not be that dramatic because I am also parsing the XML document with a parser and this is the most time consuming part.
  • I am not doing hit highlighting in the hitlist, only in the target documents. Doing hit highlighting in the hitlist seems easy. You can use the same transformer. But since I am displaying the first few lines of the document which may not contain the words you are looking for, this may appear confusing. As an alternative you could dynamically show the part of the document that contains the words you are looking for. But for this to work you must store the complete document in the Lucene index.
  • You have to maintain the package structure in /cocoon/WEB-INF/classes. If you want to change datagram.nl into something else, you have to change it at a number of places (xml, xsl , helper java classes, logic sheets, sitemap, cocoon.xconf). Don't try this unless you know what you are doing. It is easy to miss an occurance, and your project will turn into a debugging nightmare.
  • There is a little problem with literals. I you are looking for 'Bill Clinton', Lucene correctly finds all documents with this phrase and the highlighter highlights them correclty. But all other occurences of the words 'bill' and 'clinton' in the same document are also highlighted (some of you may call this a feature instead of a bug).
  • The tag string and especially the Lucene Analyzer are hardcoded into the Java Transformer class. If you want to switch analyzer for indexing, you also have to recompile your transformer.
  • It would be nice if you could limit highlighting to certain elements in the target document.