Platform: debian / nutch 0.7
When starting to work with nutch, I had some problems to get crawls of the local filesystem to work.
In the following, I document the steps that I undertook to get nutch to do what I expected
In the proceeding of this document I refer to that directory as CRAWL_HOME.
It is not strictly necessary to create a new directory but it helps getting things separated.
mkdir crawl-localfs cd crawl-localfs
Now copy the conf directory from the nutch distro to the new directory. This allows you to experiment with conf-settings without destroying other working settings
cp -r NUTCH_HOME/conf .
In the new conf directory, I changed nutch-site.xml to include the protocol-file plugin and set the limit for file content to -1. The latter property tells Nutch not to parse files that are larger than the given size (in kilobytes, I think). If this property is set to -1, Nutch parses every file regardless of its size.
This is my nutch-site.xml
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?> <!-- Put site-specific property overrides in this file. --> <nutch-conf> <property> <name>plugin.includes</name> <value>protocol-file|urlfilter-regex|parse-(xml|text|html|js|pdf)|index-basic|query-(basic|site|url)</value> </property> <property> <name>file.content.limit</name> <value>-1</value> </property> </nutch-conf>
Next file to change is crawl-urlfilter.txt. I removed file from the skip-protocol-line and added http instead.
The next two lines tell Nutch to skip unparseable filetypes as well as query-like urls. Leave them unchanged. Disable the following line, accept hosts and change the last line from skip anything else to accept anything else. You do this by writing ”+.*”
This is my crawl-urlfilter.txt
#skip http:, ftp:, & mailto: urls
#skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$
#skip URLs containing certain characters as probable queries, etc. -[?*!@=]
#accept hosts in MY.DOMAIN.NAME
#accecpt anything else +.*
In order to control Nutch’s logging, create a file logging.properties and place in in the CRAWL_HOME/conf.
My logging.properties is really basic. It logs on a very verbose level to the console and looks like this:
handlers= java.util.logging.ConsoleHandler .level= FINE
Nutch uses jdk1.4-logging so please refer to Sun’s documentation to learn more about the options you have.
The entries in Urls point to files and directories that Nutch shall crawl.
Specifying directories with “file://” or with “file:” made no difference in my case, but I think the correct syntax is “file://”.
I found it necessary to add a trailing slash to each directory-entry:
My Urls file looks like this
file:///home/cf/tests/ file:///data/topicmaps/Nutch recognises that the third line does not contain a valid file-url and skips it
The following took me some time. The crawler was not restricted to the directories that I specified in the Urls file but it was jumping into the parent directories as well. The code that is responsable for this is in org.apache.nutch.protocol.file.FileResponse.getDirAsHttpResponse(java.io.File f).
While it is obvious – when looking at the code – that this behavior was intended by the author, I failed to understand the motivation behind it. For my own crawlings I changed the code in a way that only directories beneath the directories that I specify get crawled.
I changed the following line:
this.content = list2html(f.listFiles(), path, "/".equals(path) ? false : true);
this.content = list2html(f.listFiles(), path, false);
To fasten recompilation of the single plugin add the following target to build.xml in NUTCH_HOME. To build an new jar you have to call ant with targets compile-file-protocol-plugin and jar.
<target name="compile-file-protocol-plugin"> <ant dir="src/plugin/protocol-file" target="deploy" inheritAll="false"/> </target>
To get Nutch to use the new conf-directory you have to set the NUTCH_CONF_DIR enviroment variable.
The path to the logging.properties must be set with the NUTCH_OPTS enviroment variable.
My shell script to start the crawl looks like this
#! bin/sh export NUTCH_CONF_DIR=/data/software/java/nutch/crawl-localfs/conf/ export NUTCH_OPTS=-Djava.util.logging.config.file='/data/software/java/nutch/crawl-localfs/conf/logging.properties'
#remove last crawl results rm -r crawlresult_localfs
#start a new crawl ../nutch-0.7/bin/nutch crawl urls -dir crawlresult_localfs
The script must be executed from CRAWL_HOME