I recently discovered that I was not able to find all text in a rather large document.
It turned out that solr truncates documents that are larger than 10000 words (default setting).
I increased the number to 1000000 words and re-indexed and saw that my document was now fully searchable.
The setting is found in solrconfig.xml
<maxFieldLength>1000000</maxFieldLength>
tirsdag den 14. december 2010
torsdag den 14. oktober 2010
Splitting values to multivalued fields in solr
When you need to split values into a multi value field (e.g. when crawling websites and the meta keywords should end up as separate values and not just a long comma separated string) you can use this fieldtype:
One thing that I discovered was that when looking at the values in the search result xml, the field output was formattet exactly like the input. So it looks like the tokenization has failed. But when looking into the schema browser you can see that the values are separated nicely.
<fieldType name="semicolonDelimited" class="solr.TextField">
<analyzer>
<tokenizer class="solr.PatternTokenizerFactory" pattern="\|" />
</analyzer>
</fieldType>
One thing that I discovered was that when looking at the values in the search result xml, the field output was formattet exactly like the input. So it looks like the tokenization has failed. But when looking into the schema browser you can see that the values are separated nicely.
Etiketter:
commadelimited,
commaseparated,
patterntokenizerfactory,
regex,
separator,
solr,
tokenizer
Danish letters in Solr on Tomcat
In order to have danish/swedish/norwegian (or other special characters) made searchable in solr I figured out that this was the way to go:
- Make sure that your solr conf folder contains this file: mapping-ISOLatin1Accent.txt
This file has mappings from special chars to non-special chars: (å => aa, æ => ae, ø => o (oe) etc.) - In your schema.xml in your field type definition (e.g. text) the first analyzer in the analyzer chain should be the mapper.
This goes for both index-analyzer and query-analyzer.
At this point I thought that all was fine an dandy... But not! When searching for danish letters in the solr admin interface the letters were not correct encoded. - The fix (setting) was found in a tomcat config file called: server.xml (Tomcat x.x/conf/server.xml)
This file has Service node that has a connector node. This node was missing the URIEncoding attribute. So I added the attribute and my node was now: - <Connector port="8080" protocol="HTTP/1.1"
- URIEncoding="UTF-8"
- connectionTimeout="20000"
- redirectPort="8443" />
- A quick restart of the server and some IE cache clearing later and the problem was solved!
fredag den 18. juni 2010
FAST ESP Commands
Starting and stopping the FAST server
- net stop/start fastespservice : starts/stops the FAST server
Push a document into the index (collection: webdocuments)
- D:\ESP\adminserver\webapps\help\pdfs\assets\pdf>docpush -c webdocuments Enterprise_Crawler_Guide.pdf
Adding a -d in front of the document deletes the document again.
- D:\ESP\adminserver\webapps\help\pdfs\assets\pdf>docpush -c webdocuments -d Enterprise_Crawler_Guide.pdf
Delete/remove a collection
- collection-admin -m delcollection -n webdocuments
Clear the contents of a collection
- collection-admin -m clearcollection -n webdocuments
onsdag den 28. april 2010
Custom pipeline stages with python
Here's a little guide on howto start making your own pipeline stages in python.
- Create a configuration xml file for the stage in esp\etc\processors
- Create a python code file for the processor in esp\lib\python2.3\processors
- Restart the config server to have the GUI include the new stage (nctrl restart configserver)
- Reset psctrl to compile the python to a pyc (psctrl reset)
- Sometimes 4. is not enough to build the pyc, so I also use (nctrl restart procserver_1 procserver_2 ... procserver_x) to make another build attempt.
mandag den 26. april 2010
Setting max hits (maxoffset) in FAST esp
Been working on a little app that I use to compare two collections. I use it after updates to see if I've broken anything...
When a collection contains more that 10000 docs FAST will only return the first 10020 docs. Here is how to configure that limit:
Files:
- $FASTSEARCH/etc/config_data/RTSearch/webcluster/fdispatch.addon
- $FASTSEARCH/etc/config_data/QRServer/webcluster/etc/qrserver/qrserverrc
- $FASTSEARCH/etc/topfdispatchrc (if applicable)
are to be edited on search and QRServer nodes. Value maxoffset is to be set
fredag den 9. april 2010
no doc procs registered to process a batch with priority 0
I came across this error message recently and I was able to figure out the problem.
WARNING Could not send batch to ESP content distributor, will retry automatically. Reason given: process() failed: exception (no_resources) no doc procs registered to process a batch with priority 0
The problem was found in the document processor log files. (var\log\procserver\)
It turned out that a file was missing in esp\var\procserver\\\xxx.
What we did in order to solve the problem was:
1. stop the indexingdispather
2. stop all proc servers
3. delete all files and folders under esp\var\procserver\
4. start the indexingdispatcher
5. start proc servers
Abonner på:
Opslag (Atom)