DotCMS: Externalizing Apache Nutch Indexing

This guide is for those trying to generate a search index for a DotCMS 1.9 installation using an external copy of Apache Nutch. While DotCMS 1.9 comes bundled with Nutch 1.1 you may still want to run Nutch outside of the DotCMS process for a couple of reasons:

  1. Running Nutch from within DotCMS consumes a lot of memory. A gigabyte or more is not unusual even for a small site.
    Nutch Memory Usage
  2. It is not possible to modify the behaviour of Nutch from within DotCMS, since the configuration is created on the fly in memory.

The solution is to run Nutch in a separate process. If we run out of memory, the indexing process fails, but the DotCMS instance stays functional. This solves problem #1. And because we control how Nutch is run, we can configure it as we please to solve problem #2.

Getting Started

The first thing to do is to download Nutch 1.1 from the official site. Grab the “bin” version unless you wish to compile the nutch executable yourself. Stick with Nutch 1.1 instead of going for the latest version in order to maintain the most compatibility with DotCMS 1.9.

Borrowing DotCMS’s Configuration

Since querying the index will still be performed by DotCMS, which includes its own configuration of Nutch, we need to make sure that we stay compatible.

DotCMS includes the Snowball Analyzer, which is missing from the Nutch 1.1 bin package. So, copy <DOTCMS>/dotCMS/WEB-INF/crawler_plugins/plugins/analysis-snowball to <NUTCH>/plugins/. Also copy the two JARs from the former directory into <NUTCH>/lib. If you skip this step, DotCMS will use the “stemmed” version of your search terms (ie. “integratio” instead of “integration”), but the index will not be aware of this stemming behaviour, and will return 0 results.

Creating Our Own Config Files

To get a basic configuration of Nutch:

  1. Update <NUTCH>/conf/crawl-urlfilter.txt to include your domain name instead of the default “MY.DOMAIN.NAME”.
  2. Update <NUTCH>/conf/nutch-default.xml:
    1. Set the “http.agent.name” property to the name of your crawler (ie. “The most glorious crawler of the intertubes”).
    2. Set the “http.robots.agents” property to include the name you specified for “http.agent.name”.
    3. Set the “plugin.includes” property to be nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html|swf|text|zip|js|tika)|language-identifier|index-(basic|anchor|more)|query-(basic|site|url|more|custom|content)|lib-lucene-analyzers|response-(json|xml)|summary-(basic|lucene)|scoring-opic|urlnormalizer-(pass|regex|basic)|analysis-snowball. These are the same plugins that DotCMS configures for its built-in Nutch instance.
  3. Create url_folder/urls.txt and paste (one per line) the full URLs of the sites you want to crawl.
  4. Create a crawl directory, where Nutch will save the index.

This is just the minimum required to get a usable index generated. Nutch provides many other configuration options, which you should by all means experiment with!

Java 1.5 Compatibility

Apache Nutch 1.1 is compiled for JVM 1.6. However, in order to make DotCMS compatible with JVM 1.5 it seems that its team compiled Nutch’s libraries using JDK 1.5 and bundled them with DotCMS 1.9. In case you can’t use JVM 1.6 to run Nutch, you will need to copy the following JARs from <DOTCMS>/dotCMS/WEB-INF/lib/ into <NUTCH>/lib:

jakarta-oro.jar
hadoop-0.20.3-dev-core.jar
oro.jar
hadoop-0.20.3-dev-tools.jar
hsql.jar
commons-collections-3.2.jar
commons-httpclient.jar
wstx-asl-3.2.8.jar
commons-lang-2.4.jar
commons-logging-1.1.1.jar
junit-3.8.2.jar
commons-beanutils.jar

You will also need to delete the incompatible JARs from <NUTCH>/lib/:

jakarta-oro-2.0.8.jar
hadoop-0.20.2-core.jar
commons-httpclient-3.1.jar
commons-collections-3.2.1.jar
hsqldb-1.8.0.10.jar
oro-2.0.8.jar
hadoop-0.20.2-tools.jar
wstx-asl-3.2.7.jar
commons-lang-2.1.jar
commons-logging-1.0.4.jar
junit-3.8.1.jar
commons-beanutils-1.8.0.jar

Running the Nutch Indexer

The following command will run the Nutch crawler on the URLs specified in text files inside the url_folder/ directory, follow at least 4 levels of links, and output the resulting index into the crawl/ directory:

> bin/nutch crawl urls -dir crawl -depth 4

Updating DotCMS’s Index

Once the search index is generated, you’ll need to tell DotCMS that it should use it. Counterintuitively, this is not done by overwriting the existing index, but by moving the generated index next to the existing one with a suffix of “_temp”:

  1. Find out the host ID of the DotCMS host you’re indexing. This can be done by looking under CMS Admin > Hosts (click on the host name and check the “Identity” field). Let’s use 71e723d1-450e-41c0-ba48-385912103c9c as an example.
  2. Copy the contents of the crawl directory where your new index resides to <DOTCMS>/dotCMS/assets/search_index/yourdomain.com/71e723d1-450e-41c0-ba48-385912103c9c_temp/crawl-index. Replace “yourdomain.com” with the host name of your site in DotCMS. You should have the crawldb, indexes, etc. directories under the crawl-index directory.
  3. The next time you run a search from your site, DotCMS will make our new index the one it will use for all future searches. The “_temp” suffix will be dropped from the index path allowing you to update the index again the same way in the future. To see how this is done, check CrawlerUtil.refreshIndexForHost() in the DotCMS source code.

At this point you should have an updated search index for your DotCMS site. You can script the crawling using your language of choice, schedule regular index updates with cron, and still leave enough RAM for DotCMS to serve your content. Did this technique work for you? Let me know in the comments!

Add a comment

Comment feed
The better to greet you with
No one will ever see this
Your pride and joy
The reason this comment form exists

The crew behind ASOT

We're a team of interactive, software, and business intelligence experts skilled in the design, construction, and management of online enterprise systems.

Visit The Jonah Group site

Get in touch with us