Monday 25 July 2011

Solr Schema.xml notes

I havent blogged about SOLR for a while. Busy with other projects and all that Jazz.
Since I last blogged about SOLR i have gotten a lot more knowledgeable about how it
all fits together.

This post is mostly intended to be useful for myself in remembering what everthing in solr's "schema.xml" file is for.

The SOLR schema document is split up into two main sections - with some additional parameters than can be specified afterwards.

these sections are as follows.

<schema>
     <types>
        ......
    </types>
 
    <fields>
       ......
   </fields>
    <uniqueKey ... />
    <defaultSearchField ... />
    <solrQueryParser ... />
</schema>
 
The schema.xml file is not scary at all - all it really does is describe the mappings from your database fields and maps them to fields in the solr index. each field you specify in the fields section should be given a name , type and some options that define how what should be done with a field once it is processed by solr.
for example if you have a text field in your database called "Title" you might decide to create a mapping like this ...

<field name="Title" type="text" indexed="true" stored="true"/>

to explain what this does....

the field "name" parameter tells solr what field (from the database) we are dealing with. the "type" parameter is one of the "fieldType" attributes defined in the "types" section of schema.xml. In this case we are using the text field type as for our use case we want a flexible case-insensitive, stemmed match on search queries. You may want to use a different field type for your use case.

The other two parameters are important as well.

indexed="true" --- this means that the field can be queried / searched against
stored="true" -- means the value of the result is stored and can be retrieved for outputting in the result set.

Its entirely possible you might want to only retrieve the a list of keys / id's and pull them directly from your databases when generating your results. If this is what you want , you could set "stored='false' for all your fields except the ID.   In most cases though having Solr - is a great way to reduce load on your database and if you are dedicating server/s to your search why the hell not suck all the data from SOLRs index? ..

Another consideration is the size of your dataset. If it is large and time consuming to build your index you will want to be careful about how you analyse fields and whether they really need to be stored or not? Failure to consider this may leave you with a search index that is difficult to manage and keep up to date.

So ... what about all those "fieldTypes" at the top of schema.xml ?

If you are just starting out and want to get your data indexed you probably wont need to change anything here! I would advise reading through some of the comments in the schema.xml file as the important fieldTypes are explained here. Remember the "type" attribute in the "fields" section ? This must be a valid fieldType defined herein.

Each fieldType defines the fields used in the index - take care to use the most appropriate types used by your fields. Some of the fieldTypes are just basic mappings (defined at the top string, integer etc). The more sophisticated ones define what analysis are used on a per field basis for both index time and query time. A useful tip might be - if you are using a field type for one of your field mappings but its not quite
doing what you want. Try copying the block of XML associated with the fieldType you are using. Use the new one and tweak its settings - using a different FilterFactory for example.

Finally there are some additional parameters at the bottom of the XML file that define defaults or allow you to copy fields.

as follows
<uniqueKey>id</uniqueKey> - you must specify this  (self explanatory)
 <solrQueryParser defaultOperator="OR"/>
in the solrQueryParser specify what you want the default operator is do you want searchs for "cat+dog" to find documents containing both  "cat AND dog" or do you want it to find documents containing "cat OR dog" ? this can be changed at query time if needed.

<copyField source="srcfield" dest="destfield"/>

you can also configure "copyFields" this is useful if you want a field to be indexed / queried
using two different methods. define these afterwards.

I hope this helps lift some of the mystery of this configuration file.

Nick ...