Enterprise Search with FAST and Solr

tirsdag den 14. december 2010

MaxFieldLength in Solr - When indexing very large documents

I recently discovered that I was not able to find all text in a rather large document.
It turned out that solr truncates documents that are larger than 10000 words (default setting).
I increased the number to 1000000 words and re-indexed and saw that my document was now fully searchable.

The setting is found in solrconfig.xml

<maxFieldLength>1000000</maxFieldLength>

torsdag den 14. oktober 2010

Splitting values to multivalued fields in solr

When you need to split values into a multi value field (e.g. when crawling websites and the meta keywords should end up as separate values and not just a long comma separated string) you can use this fieldtype:


<fieldType name="semicolonDelimited" class="solr.TextField">
    <analyzer>
        <tokenizer class="solr.PatternTokenizerFactory" pattern="\|" />
    </analyzer>
</fieldType>

One thing that I discovered was that when looking at the values in the search result xml, the field output was formattet exactly like the input. So it looks like the tokenization has failed. But when looking into the schema browser you can see that the values are separated nicely.

Danish letters in Solr on Tomcat

In order to have danish/swedish/norwegian (or other special characters) made searchable in solr I figured out that this was the way to go:

Make sure that your solr conf folder contains this file: mapping-ISOLatin1Accent.txt
This file has mappings from special chars to non-special chars: (å => aa, æ => ae, ø => o (oe) etc.)
In your schema.xml in your field type definition (e.g. text) the first analyzer in the analyzer chain should be the mapper.

This goes for both index-analyzer and query-analyzer.
At this point I thought that all was fine an dandy... But not! When searching for danish letters in the solr admin interface the letters were not correct encoded.
The fix (setting) was found in a tomcat config file called: server.xml (Tomcat x.x/conf/server.xml)
This file has Service node that has a connector node. This node was missing the URIEncoding attribute. So I added the attribute and my node was now:

<Connector port="8080" protocol="HTTP/1.1"
URIEncoding="UTF-8"
connectionTimeout="20000"
redirectPort="8443" />

A quick restart of the server and some IE cache clearing later and the problem was solved!

fredag den 18. juni 2010

FAST ESP Commands

Starting and stopping the FAST server

net stop/start fastespservice : starts/stops the FAST server

Push a document into the index (collection: webdocuments)

D:\ESP\adminserver\webapps\help\pdfs\assets\pdf>docpush -c webdocuments Enterprise_Crawler_Guide.pdf

Adding a -d in front of the document deletes the document again.

D:\ESP\adminserver\webapps\help\pdfs\assets\pdf>docpush -c webdocuments -d Enterprise_Crawler_Guide.pdf

Delete/remove a collection

collection-admin -m delcollection -n webdocuments

Clear the contents of a collection

collection-admin -m clearcollection -n webdocuments

onsdag den 28. april 2010

Custom pipeline stages with python

Here's a little guide on howto start making your own pipeline stages in python.

Create a configuration xml file for the stage in esp\etc\processors
Create a python code file for the processor in esp\lib\python2.3\processors
Restart the config server to have the GUI include the new stage (nctrl restart configserver)
Reset psctrl to compile the python to a pyc (psctrl reset)
Sometimes 4. is not enough to build the pyc, so I also use (nctrl restart procserver_1 procserver_2 ... procserver_x) to make another build attempt.

mandag den 26. april 2010

Setting max hits (maxoffset) in FAST esp

Been working on a little app that I use to compare two collections. I use it after updates to see if I've broken anything...

When a collection contains more that 10000 docs FAST will only return the first 10020 docs. Here is how to configure that limit:

Files:

$FASTSEARCH/etc/config_data/RTSearch/webcluster/fdispatch.addon
$FASTSEARCH/etc/config_data/QRServer/webcluster/etc/qrserver/qrserverrc
$FASTSEARCH/etc/topfdispatchrc (if applicable)

are to be edited on search and QRServer nodes. Value maxoffset is to be set and processes qrserver, topfdispatch and search-1 restarted.

fredag den 9. april 2010

no doc procs registered to process a batch with priority 0

I came across this error message recently and I was able to figure out the problem.

WARNING    Could not send batch to ESP content distributor, will retry automatically. Reason given: process() failed: exception (no_resources) no doc procs registered to process a batch with priority 0

The problem was found in the document processor log files. (var\log\procserver\)

It turned out that a file was missing in esp\var\procserver\\\xxx.

What we did in order to solve the problem was:

1. stop the indexingdispather

2. stop all proc servers

3. delete all files and folders under esp\var\procserver\

4. start the indexingdispatcher

5. start proc servers