Monday 25 July 2011

Solr Schema.xml notes

I havent blogged about SOLR for a while. Busy with other projects and all that Jazz.
Since I last blogged about SOLR i have gotten a lot more knowledgeable about how it
all fits together.

This post is mostly intended to be useful for myself in remembering what everthing in solr's "schema.xml" file is for.

The SOLR schema document is split up into two main sections - with some additional parameters than can be specified afterwards.

these sections are as follows.

<schema>
     <types>
        ......
    </types>
 
    <fields>
       ......
   </fields>
    <uniqueKey ... />
    <defaultSearchField ... />
    <solrQueryParser ... />
</schema>
 
The schema.xml file is not scary at all - all it really does is describe the mappings from your database fields and maps them to fields in the solr index. each field you specify in the fields section should be given a name , type and some options that define how what should be done with a field once it is processed by solr.
for example if you have a text field in your database called "Title" you might decide to create a mapping like this ...

<field name="Title" type="text" indexed="true" stored="true"/>

to explain what this does....

the field "name" parameter tells solr what field (from the database) we are dealing with. the "type" parameter is one of the "fieldType" attributes defined in the "types" section of schema.xml. In this case we are using the text field type as for our use case we want a flexible case-insensitive, stemmed match on search queries. You may want to use a different field type for your use case.

The other two parameters are important as well.

indexed="true" --- this means that the field can be queried / searched against
stored="true" -- means the value of the result is stored and can be retrieved for outputting in the result set.

Its entirely possible you might want to only retrieve the a list of keys / id's and pull them directly from your databases when generating your results. If this is what you want , you could set "stored='false' for all your fields except the ID.   In most cases though having Solr - is a great way to reduce load on your database and if you are dedicating server/s to your search why the hell not suck all the data from SOLRs index? ..

Another consideration is the size of your dataset. If it is large and time consuming to build your index you will want to be careful about how you analyse fields and whether they really need to be stored or not? Failure to consider this may leave you with a search index that is difficult to manage and keep up to date.

So ... what about all those "fieldTypes" at the top of schema.xml ?

If you are just starting out and want to get your data indexed you probably wont need to change anything here! I would advise reading through some of the comments in the schema.xml file as the important fieldTypes are explained here. Remember the "type" attribute in the "fields" section ? This must be a valid fieldType defined herein.

Each fieldType defines the fields used in the index - take care to use the most appropriate types used by your fields. Some of the fieldTypes are just basic mappings (defined at the top string, integer etc). The more sophisticated ones define what analysis are used on a per field basis for both index time and query time. A useful tip might be - if you are using a field type for one of your field mappings but its not quite
doing what you want. Try copying the block of XML associated with the fieldType you are using. Use the new one and tweak its settings - using a different FilterFactory for example.

Finally there are some additional parameters at the bottom of the XML file that define defaults or allow you to copy fields.

as follows
<uniqueKey>id</uniqueKey> - you must specify this  (self explanatory)
 <solrQueryParser defaultOperator="OR"/>
in the solrQueryParser specify what you want the default operator is do you want searchs for "cat+dog" to find documents containing both  "cat AND dog" or do you want it to find documents containing "cat OR dog" ? this can be changed at query time if needed.

<copyField source="srcfield" dest="destfield"/>

you can also configure "copyFields" this is useful if you want a field to be indexed / queried
using two different methods. define these afterwards.

I hope this helps lift some of the mystery of this configuration file.

Nick ...

Wednesday 23 March 2011

Disabling Adverts in Spotify on Ubuntu Linux

Ok .. this is completely unrelated to my Solr stuff.
But - I wanted to share this useful snippet.

(NB)

* This does not actually disable the adverts - it simply mutes the volume for approximately the same time as a spotify ad lasts.
* Its not an automatic solution - Its only really useful if you are listening to spotify whilst you are sitting at the keyboard.
* It works for me!

The reason I wrote this is because some of the Spotify adverts were

* Not relevant to me ( I am most definitely not interested in Tiny Tempah )
* Really irritating (Go Compare / Compare the Meerkat)

So I could have ponied up and gone for the subscription , However there are a few reasons for this.

* My spotify usage is not really very high. (I have my entire CD collection already ripped and sitting locally)
* The Linux version is not up to scratch i use the Windows version through Wine.

I might re-evaluate this at some point in the future - this is how it is now.

So heres how it works

I created a small Shell script to mute the volume for 45seconds , and then bring it back up again.
The script for ubuntu is as follows.


#!/bin/bash

pactl set-sink-mute 0 yes
sleep 45
pactl set-sink-mute 0 no


Save this script somewhere ( eg ~/mute.sh )
make it executable
#> chmod u+x ~/mute.sh
Now you are going to create a custom keybinding in Ubuntu
Go to 
System -> Preferences -> Keyboard Shortcuts
Click "Add"
you will get a pop-up window asking for an name and command
Call it  "MuteVol1Min"
and set the command to "~/mute.sh" (or wherever you saved the script)
Click Apply
Finally you need to bind it to a key combo.
click on the "shortcut" column next to the custom shortcut you just created.
I bound mine to the windows key + pause/break key - that comes up as Mod4+Pause.
You could choose the same combination - or try something different.
To use it - as soon as you notice that an advert comes up hit the key combo - sound will mute and come back up 45seconds later after the advert is finished.
Its not perfect but it works well for me as im usually listening and coding at the same time.
Hope this helps someone
N...

Tuesday 22 March 2011

Installing SOLR on EC2

Detailed below are my steps for installing Solr on EC2 with Ubuntu

Before we go any further , i would like you to note that I am making the assumption that 
you have already signed up for Amazon Web Services and you have also downloaded the command line tools.

If you are not there yet then you are going to have to follow the following resources 
before going any further.


* Install the command line tools https://help.ubuntu.com/community/EC2StartersGuide

Once you are ready please continue.

As of now Lucidworks do not provide an AMI for installing Solr outside of the US.
I am going to need to create my own AMI to solve this problem.
Fortunately this is a fairly simple task - especially if you use the nightly build.

You will need to start by creating a new instance.

Using the ec2 command line tool "ec2-run-instances" fire up a new EC2 Instance based off Ubuntu
The format is as follows

ec2-run-instances AMI-IMAGE-ID --instance-type INSTANCE-TYPE --region eu-west-1 -k AMAZON-KEY

For the purposes of the demo i am using a t1.micro just so we can get the hang of the process. In no way am I recommending that you run the whole SOLR stack on a t1.micro - that's crazy talk. Personally I wouldnt recommend anything lower than a m1.large for production unless you have a very specific use case. eg: very low traffic, very small document set. (Please evaluate sphinx to make sure SOLR is right for you - life might get a lot easier ) Anyhow..

to determine the AMI you wish to use you can query AWS using "ec2-describe-images"
eg


#> ec2-describe-images --region eu-west-1 --all | grep 'ubuntu'


There are plenty of AMI's to choose from so pipe the output into grep to help search for the one 
you are looking for. Or just choose the same one as me. Pick one with "ebs" store you will need this especially if you are testing out a micro instance. The AMI id's are listed in the second column.

Once you are ready to roll issue the following command to fire up the instance...





#> ec2-run-instances ami-e974439d --instance-type t1.micro --region eu-west-1 -k myAWSkey
When this has completed output on the command line will contain the id of the instance you have just fired up. You will now use this information to get the public DNS of the instance. (This is needed so you can log into the VM)

The standard ubuntu instances can be accessed through SSH. 

#> ssh -i /etc/aws-keys/YOUR-AWS-KEY.pem ubuntu@your.public.aws.dns.eu-west.compute.amazon.com
From here on in you are in a standard ubuntu install - so you can go ahead and configure it how you might need. However for our purposes we are just going to set up a search user and place the latest SOLR nightly there for further usage.

#> sudo useradd -d /home/search -m search
Now lets go to that newly created folder and install the nightly

#> sudo useradd -d /home/search -m search
#> cd /tmp
#> wget http://mirror.lividpenguin.com/pub/apache//lucene/solr/1.4.1/apache-solr-1.4.1.tgz
#> gunzip apache-solr-1.4.1.tgz
#> tar -xvf apache-solr-1.4.1.tar
Now lets move it to the search users folder

#> mv /tmp/apache-solr-1.4.1 /home/search

Before we can run Solr for the first time - you will need to install JAVA

#> sudo apt-get update
#> sudo apt-get install openjdk-6-jre-headless
If you have any problems with the install failing try again adding --fix-missing
Thats basically it for now you can  run solr by issuing the 
following command

#> sudo java -server -jar /home/search/apache-solr-1.4.1/example/start.jar
Of course, right now this isnt going to be of much use to you! 
But it at least gives you some steps to go through in order to get a Solr server up and running on EC2.

If you have been following my other posts , or maybe you have a Solr set-up from the nightlies running in a development sandbox. You could quite easily zip up the entire folder and install it on the instance by using "scp" to copy it up - and then follow the rest of the steps.

I hope this helps get you started.

I will probably be doing a follow-up post on this tommorow , so keep posted and let me know if you have any issues.

N.


Wednesday 2 March 2011

Solr : last_index_time -- explained

There seems to be some confusion about what solr's "last_index_time" means.

This is a fairly important thing to understand when setting Solr up to do its Delta Updates.

Solr's last_index_time holds the timestamp that is used to determine the last time an indexing operation STARTED,  not as some people believe when it ended.

When performing updates this value is usually used in the delta query to select records which have been modified or updated since the last indexing operation begun.  It is critical to ensure that there is no timeframe window between the "last_index_time" when doing your first full-import and your scheduled updates. Otherwise this could cause you to have some documents missing from your index.

N...

Thursday 17 February 2011

Solr Configuration Options : Slow updates ( mergeFactor )

If you are experiencing slow Import / Updates there is a setting in solrconfig.xml that can dramatically affect the indexing time.


As described in the Solr Documentation mergeFactor is an important consideration. The only issue being the higher you set the mergeFactor the slower the response for a search query is going to be . The question then becomes how much that matters to you. If your index only needs updating every week then you could set the mergeFactor to a low setting and get rapid searches. On the other hand if you need your search results updated more frequently then you will have to set it higher (depending on the amount of content in your domain.) Possibly having a negative impact on the speed of your search responses.

 

mergeFactor Tradeoffs


High value merge factor (e.g., 25):
  • Pro: Generally improves indexing speed
  • Con: Less frequent merges, resulting in a collection with more index files which may slow searching
Low value merge factor (e.g., 2):
  • Pro: Smaller number of index files, which speeds up searching.
  • Con: More segment merges slow down indexing. 


Tuesday 8 February 2011

Configuring Solr 1.4.1

Those of you following my posts will recall that I inherited an already configured version of Solr. The trouble is, it was basically just the "example" version of Solr with a few tweaks.

According to "The 7 Deadly Sins of Solr" its incredibly common to find Solr installs which are (like mine) just modifications of the example app - mine even stretches as far as still containing "Solr Rocks" in the solrconfig.xml file.

So without further ado im going to fix those two issues.

To start with ill be renaming the "example" folder to something proper.
You can call it whatever you feel is appropriate; just remember that from
now i will be referring to the search app as "ProductIndex"

%> mv example ProductIndex


Next thing i am going to do is get rid of the "Solr Rocks" and see if we can tidy up the config.

The main config file for solr is located in "ProductIndex/solr/conf"
lets see what else is there.

%> cd ProductIndex/solr/conf
%> ls -1

admin-extra.html
elevate.xml
mapping-ISOLatin1Accent.txt
protwords.txt
schema.xml
scripts.conf
solrconfig.xml
spellings.txt
stopwords.txt
synonyms.txt
xslt





The file we are interested in is the "solrconfig.xml"

Lets attack it

%> vim solrconfig.xml


Ok so lets find the "solr rocks" section and see whats going on there

in vim

esc :417

will take you directly to the line in question

in context you will see something like this

<!-- a firstSearcher event is fired whenever a new searcher is being
         prepared but there is no current registered searcher to handle
         requests or to gain autowarming data from. -->
<listener class="solr.QuerySenderListener" event="firstSearcher">
<arr name="queries">
<lst> <str name="q">solr rocks</str><str name="start">0</str><str name="rows">10</str></lst>
<lst><str name="q">static firstSearcher warming query from solrconfig.xml</str></lst>
</arr>
</listener>

So whats all that mean ?

Well - basically it sets up a simple warming query that will get run if there is no configuration
set up for any requests that might be made. (kind of like it says in the comment). It is usually used
when Solr is first started to warm up the cache. right now this is not going to be of any use unless by some stroke of coincidence you have documents containing the search term "Solr Rocks"

So lets put something sensible in there like a really common query. You can specify several if you wish

Im going to change mine to say the following.

 <!-- a firstSearcher event is fired whenever a new searcher is being
         prepared but there is no current registered searcher to handle
         requests or to gain autowarming data from. -->
    <listener event="firstSearcher" class="solr.QuerySenderListener">
      <arr name="queries">
        <lst> <str name="q">Tiesto</str><str name="start">0</str><str name="rows">10</str></lst>
      </arr>
    </listener>


Now reading further the comment in the code says this will get run when there is no current registered searcher.


Upgrading to Solr 1.4

The version of Solr I have inhertied is now looking a bit stale. Solr 1.4 provides a number of improvements.

Namely performance enhancements in indexing, searching, and faceting.
Improved Java index replication
Easier Configuration
Additional Document types can be indexed

Plus a bunch of other stuff .

Read the release notes for further information.

Anyway down to the nitty gritty.

There are a few things I am going to do before beginning this process.
Namely, Im going to need a database ready and waiting for Solr to index.

Like many of us I use a mysql database. Our dataset it quite large so thats
going to need to be downloaded and set-up before we start. If you need to do this I suggest you kick off a dump and copy your live database to a development sandbox and get it up and running.

I dont have the luxury of being able to run an index on a live , active database.

Also something else you should note is that the solr index files are not interechangeable very easily between different versions of Solr or if you make significant schema changes in Solr's configuration. So you cant really skip this step - make provisions for the time its going to take to setup all your database and indexes.

So ... assuming your database is ready to roll lets begin by downloading one of the apache Solr nightlies.

I choose these as they contain everything you need to run Solr including Jetty , the http servlet container which runs the framework.

I will be going for the following file ...
apache-solr-1.4.1.tgz

which you can grab here.

( you can download directly using the following command )
 wget http://www.mirrorservice.org/sites/ftp.apache.org//lucene/solr/1.4.1/apache-solr-1.4.1.tgz 


once you have downloaded the package you will need to decompress it

tar -xvf apache-solr-1.4.1.tgz


That is basically it. You should be able to start Solr to check its working using the example application.

go into the example folder and start Solr ...

%> cd apache-solr-1.4.1/example
%> java -Xms1024M -Xmx2512M -server -jar start.jar


Solr should spit out a bunch of Information culminating in

INFO: [] Registered new searcher Searcher


Good stuff
We now have a working instance of Solr 1.4.1

See my next post for turning this stock install into something meaningful.

Inheriting a bag of solr worms


Solr
 Installation & Configuration Guide



Having inherited a badly set-up and configured Solr set-up , and tried to tame it - its painfully obvious that what is needed is to bin it and start again.

Before I go any further - If you are new to this area , and looking for a search engine to use for your site do not choose Solr without taking a really good look at Sphinx (http://sphinxsearch.com/) ; In my experience its much more lightweight and easier to get to grips and doesnt have the high overheads of Java. Its a neat, fast 'C' based application with great interfaces for PHP, Ruby and friends. If i had a choice i probably wouldnt have chosen Solr for the project which I have inherited - it adds too many layers of complexity to an already complex web application running on a completely different set of technologies. Sphinx does most of what can be done with Solr so make sure know what features you will actually need and which wont be needed.

If you are experiencing performance issues and Solr is running like a dog maybe this will
be a useful point of reference for you.

Having read the "7 deadly sins of Solr"

Im going to start off by upgrading our aging Solr Search server to a newer version, then take a real, hard look at all those configuration options and see whats going wrong.

Follow these posts they might be of assistance to you.