torsdag den 14. oktober 2010

Splitting values to multivalued fields in solr

When you need to split values into a multi value field (e.g. when crawling websites and the meta keywords should end up as separate values and not just a long comma separated string) you can use this fieldtype:



<fieldType name="semicolonDelimited" class="solr.TextField">
<analyzer>
<tokenizer class="solr.PatternTokenizerFactory" pattern="\|" />
</analyzer>
</fieldType>


One thing that I discovered was that when looking at the values in the search result xml, the field output was formattet exactly like the input. So it looks like the tokenization has failed. But when looking into the schema browser you can see that the values are separated nicely.

Danish letters in Solr on Tomcat

In order to have danish/swedish/norwegian (or other special characters) made searchable in solr I figured out that this was the way to go:

  1. Make sure that your solr conf folder contains this file: mapping-ISOLatin1Accent.txt
    This file has mappings from special chars to non-special chars: (å => aa, æ => ae, ø => o (oe) etc.)
  2. In your schema.xml in your field type definition (e.g. text) the first analyzer in the analyzer chain should be the mapper.

    This goes for both index-analyzer and query-analyzer.
    At this point I thought that all was fine an dandy... But not! When searching for danish letters in the solr admin interface the letters were not correct encoded.
  3. The fix (setting) was found in a tomcat config file called: server.xml (Tomcat x.x/conf/server.xml)
    This file has Service node that has a connector node. This node was missing the URIEncoding attribute. So I added the attribute and my node was now:

  4. <Connector port="8080" protocol="HTTP/1.1"
  5. URIEncoding="UTF-8"
  6. connectionTimeout="20000"
  7. redirectPort="8443" />


  8. A quick restart of the server and some IE cache clearing later and the problem was solved!