Thursday, 17 February 2011

Solr Configuration Options : Slow updates ( mergeFactor )

If you are experiencing slow Import / Updates there is a setting in solrconfig.xml that can dramatically affect the indexing time.


As described in the Solr Documentation mergeFactor is an important consideration. The only issue being the higher you set the mergeFactor the slower the response for a search query is going to be . The question then becomes how much that matters to you. If your index only needs updating every week then you could set the mergeFactor to a low setting and get rapid searches. On the other hand if you need your search results updated more frequently then you will have to set it higher (depending on the amount of content in your domain.) Possibly having a negative impact on the speed of your search responses.

 

mergeFactor Tradeoffs


High value merge factor (e.g., 25):
  • Pro: Generally improves indexing speed
  • Con: Less frequent merges, resulting in a collection with more index files which may slow searching
Low value merge factor (e.g., 2):
  • Pro: Smaller number of index files, which speeds up searching.
  • Con: More segment merges slow down indexing. 


Tuesday, 8 February 2011

Configuring Solr 1.4.1

Those of you following my posts will recall that I inherited an already configured version of Solr. The trouble is, it was basically just the "example" version of Solr with a few tweaks.

According to "The 7 Deadly Sins of Solr" its incredibly common to find Solr installs which are (like mine) just modifications of the example app - mine even stretches as far as still containing "Solr Rocks" in the solrconfig.xml file.

So without further ado im going to fix those two issues.

To start with ill be renaming the "example" folder to something proper.
You can call it whatever you feel is appropriate; just remember that from
now i will be referring to the search app as "ProductIndex"

%> mv example ProductIndex


Next thing i am going to do is get rid of the "Solr Rocks" and see if we can tidy up the config.

The main config file for solr is located in "ProductIndex/solr/conf"
lets see what else is there.

%> cd ProductIndex/solr/conf
%> ls -1

admin-extra.html
elevate.xml
mapping-ISOLatin1Accent.txt
protwords.txt
schema.xml
scripts.conf
solrconfig.xml
spellings.txt
stopwords.txt
synonyms.txt
xslt





The file we are interested in is the "solrconfig.xml"

Lets attack it

%> vim solrconfig.xml


Ok so lets find the "solr rocks" section and see whats going on there

in vim

esc :417

will take you directly to the line in question

in context you will see something like this

<!-- a firstSearcher event is fired whenever a new searcher is being
         prepared but there is no current registered searcher to handle
         requests or to gain autowarming data from. -->
<listener class="solr.QuerySenderListener" event="firstSearcher">
<arr name="queries">
<lst> <str name="q">solr rocks</str><str name="start">0</str><str name="rows">10</str></lst>
<lst><str name="q">static firstSearcher warming query from solrconfig.xml</str></lst>
</arr>
</listener>

So whats all that mean ?

Well - basically it sets up a simple warming query that will get run if there is no configuration
set up for any requests that might be made. (kind of like it says in the comment). It is usually used
when Solr is first started to warm up the cache. right now this is not going to be of any use unless by some stroke of coincidence you have documents containing the search term "Solr Rocks"

So lets put something sensible in there like a really common query. You can specify several if you wish

Im going to change mine to say the following.

 <!-- a firstSearcher event is fired whenever a new searcher is being
         prepared but there is no current registered searcher to handle
         requests or to gain autowarming data from. -->
    <listener event="firstSearcher" class="solr.QuerySenderListener">
      <arr name="queries">
        <lst> <str name="q">Tiesto</str><str name="start">0</str><str name="rows">10</str></lst>
      </arr>
    </listener>


Now reading further the comment in the code says this will get run when there is no current registered searcher.


Upgrading to Solr 1.4

The version of Solr I have inhertied is now looking a bit stale. Solr 1.4 provides a number of improvements.

Namely performance enhancements in indexing, searching, and faceting.
Improved Java index replication
Easier Configuration
Additional Document types can be indexed

Plus a bunch of other stuff .

Read the release notes for further information.

Anyway down to the nitty gritty.

There are a few things I am going to do before beginning this process.
Namely, Im going to need a database ready and waiting for Solr to index.

Like many of us I use a mysql database. Our dataset it quite large so thats
going to need to be downloaded and set-up before we start. If you need to do this I suggest you kick off a dump and copy your live database to a development sandbox and get it up and running.

I dont have the luxury of being able to run an index on a live , active database.

Also something else you should note is that the solr index files are not interechangeable very easily between different versions of Solr or if you make significant schema changes in Solr's configuration. So you cant really skip this step - make provisions for the time its going to take to setup all your database and indexes.

So ... assuming your database is ready to roll lets begin by downloading one of the apache Solr nightlies.

I choose these as they contain everything you need to run Solr including Jetty , the http servlet container which runs the framework.

I will be going for the following file ...
apache-solr-1.4.1.tgz

which you can grab here.

( you can download directly using the following command )
 wget http://www.mirrorservice.org/sites/ftp.apache.org//lucene/solr/1.4.1/apache-solr-1.4.1.tgz 


once you have downloaded the package you will need to decompress it

tar -xvf apache-solr-1.4.1.tgz


That is basically it. You should be able to start Solr to check its working using the example application.

go into the example folder and start Solr ...

%> cd apache-solr-1.4.1/example
%> java -Xms1024M -Xmx2512M -server -jar start.jar


Solr should spit out a bunch of Information culminating in

INFO: [] Registered new searcher Searcher


Good stuff
We now have a working instance of Solr 1.4.1

See my next post for turning this stock install into something meaningful.

Inheriting a bag of solr worms


Solr
 Installation & Configuration Guide



Having inherited a badly set-up and configured Solr set-up , and tried to tame it - its painfully obvious that what is needed is to bin it and start again.

Before I go any further - If you are new to this area , and looking for a search engine to use for your site do not choose Solr without taking a really good look at Sphinx (http://sphinxsearch.com/) ; In my experience its much more lightweight and easier to get to grips and doesnt have the high overheads of Java. Its a neat, fast 'C' based application with great interfaces for PHP, Ruby and friends. If i had a choice i probably wouldnt have chosen Solr for the project which I have inherited - it adds too many layers of complexity to an already complex web application running on a completely different set of technologies. Sphinx does most of what can be done with Solr so make sure know what features you will actually need and which wont be needed.

If you are experiencing performance issues and Solr is running like a dog maybe this will
be a useful point of reference for you.

Having read the "7 deadly sins of Solr"

Im going to start off by upgrading our aging Solr Search server to a newer version, then take a real, hard look at all those configuration options and see whats going wrong.

Follow these posts they might be of assistance to you.