Friday 12 July 2013

How to compile a custom similarity class for SOLR / Lucene using Eclipse (For someone who hasnt touched java for a long time....)


Ok so you have come to the realisation that the solr scoring algorithm is not quite doing what you need for the task at hand. You have scoured the net for possible solutions , even pestered the nerds on the #solr IRC channels. After exhausting all the possiblities you realise you are going to have to compile a new simialrity class for Solr and tweak it to your needs.

Note : Its been a long while since i did anything Java related. I welcome comments and suggestions - especially if the method outlined below seems a bit weird. I am writing this because there is little documentation on how this is done and I wish that there had been something to get me started in this area.

Assumptions : I assume that you are familiar with Eclipse and have it up and running (many people use eclipse for web development that does not involve JAVA with one of its many plugins eg: php, ruby )

In order to get up and running you will need some files from the distribution of eclipse that you are running. These files are contained within a ".war" file that comes with your solr distribution. I recommend using the file (outlined below) that comes with the same version you are going to be using the compiled similarity class with.

You are looking for a file called 

apache-solr-4.0.0.war   (your version numbers may be different)

this file usually resides in the "dist" folder.

make a folder in your eclipse "workspace"

eg :

%> mkdir ~/workspace/solr_war

copy the file here 

%> cp /path/to/apache-solr-4.0.0.war ~/workspace/solr_war

unpack the "war" file

%> cd  ~/workspace/solr_war
%> unzip apache-solr-4.0.0.war

.... stuff happens!

ok now that part is done we can move on to the Eclipse part

fire up eclipse

when loaded click

File - > New -> Java Project

give the project a new name eg:

MyNewSimilarityClass

click "Next"

click "Libraries"

click "Add External Jars"

navigate to ~/workspace/solr_war/WEB-INF/lib

select ALL jar files in this folder and click "OK" 

then click "Finish"

At this point Eclipse is now set up for you to create a new class , compile and export to a jar file.

--------------

Creating a new class

In Eclipse - on the left hand side where you have your new project 

Right click -> New -> Class

Name the class eg : MyNewSimilarityClass

and click finish.

At this point you will now have the stub of a class in your eclipse window something like this .

You will probably want to change this so that your class can extend the DefaultSimilarity class
and then you can simply over-ride these functions.

In my case I wanted to disable IDF (Inverse Document Frequency ) from the scoring algorithm 
my class ended up something like this ...


package org.apache.lucene.search.similarities;

public class MyDefaultSimilarity extends DefaultSimilarity{
  
  @Override
  public float idf(long docFreq, long numDocs) {
   return 1.0f;
  }
    
}

What your code contains may well be different from mine depending on your use 
case. There are different functions in DefaultSimilarity that can be over-ridden in addition to other scoring implementations you could extend. Please refer to the Solr WIKI's and browse the lucene search similarities packages to find out more.

Building a JAR file for use with SOLR

This one is nice and easy!

right click on your Java project

go to Export -> Java -> Jar File

Name the jar file , and pick the file destination

Click "Finish"

you will now have a jar file that can be used with your SOLR distribution.

Using a JAR file with SOLR

Your new JAR file will need to be copied into the "lib" folder of your instance folder.
this is usually in the same directory as your solr.xml file. so change to the folder where this file is
located eg:

%> mkdir /path/to/instancedir/lib

then copy JAR file here

%>  cp /path/to/myjarfile.jar /path/to/instancedir/lib/

now that your jar file is in place you just need to make sure that solr is conifigured to use it

use your favorite text editor to open solr.xml

%> vi /path/to/instancedir/solr.xml

and see that the following is in place

<solr persistent="true" sharedLib="lib">

note the   ' sharedLib="lib" ' if you have a weird directory structure
you should specify it here. other wise ensure it is as above!

Finally the next thing is to ensure that schema.xml is configured to use the new class

in my version of SOLR near the bottom of the schema.xml file are the following lines




  <!--
     <similarity class="com.example.solr.CustomSimilarityFactory">
       <str name="paramkey">param value</str>
     </similarity>
    -->


Uncomment and change to use your new class
  <!--
     <similarity class="com.example.solr.MySimilarityClass">
       <str name="paramkey">param value</str>
     </similarity>
    -->


Restart Solr to start using your new class! Hope this helps Nick ...

Monday 25 July 2011

Solr Schema.xml notes

I havent blogged about SOLR for a while. Busy with other projects and all that Jazz.
Since I last blogged about SOLR i have gotten a lot more knowledgeable about how it
all fits together.

This post is mostly intended to be useful for myself in remembering what everthing in solr's "schema.xml" file is for.

The SOLR schema document is split up into two main sections - with some additional parameters than can be specified afterwards.

these sections are as follows.

<schema>
     <types>
        ......
    </types>
 
    <fields>
       ......
   </fields>
    <uniqueKey ... />
    <defaultSearchField ... />
    <solrQueryParser ... />
</schema>
 
The schema.xml file is not scary at all - all it really does is describe the mappings from your database fields and maps them to fields in the solr index. each field you specify in the fields section should be given a name , type and some options that define how what should be done with a field once it is processed by solr.
for example if you have a text field in your database called "Title" you might decide to create a mapping like this ...

<field name="Title" type="text" indexed="true" stored="true"/>

to explain what this does....

the field "name" parameter tells solr what field (from the database) we are dealing with. the "type" parameter is one of the "fieldType" attributes defined in the "types" section of schema.xml. In this case we are using the text field type as for our use case we want a flexible case-insensitive, stemmed match on search queries. You may want to use a different field type for your use case.

The other two parameters are important as well.

indexed="true" --- this means that the field can be queried / searched against
stored="true" -- means the value of the result is stored and can be retrieved for outputting in the result set.

Its entirely possible you might want to only retrieve the a list of keys / id's and pull them directly from your databases when generating your results. If this is what you want , you could set "stored='false' for all your fields except the ID.   In most cases though having Solr - is a great way to reduce load on your database and if you are dedicating server/s to your search why the hell not suck all the data from SOLRs index? ..

Another consideration is the size of your dataset. If it is large and time consuming to build your index you will want to be careful about how you analyse fields and whether they really need to be stored or not? Failure to consider this may leave you with a search index that is difficult to manage and keep up to date.

So ... what about all those "fieldTypes" at the top of schema.xml ?

If you are just starting out and want to get your data indexed you probably wont need to change anything here! I would advise reading through some of the comments in the schema.xml file as the important fieldTypes are explained here. Remember the "type" attribute in the "fields" section ? This must be a valid fieldType defined herein.

Each fieldType defines the fields used in the index - take care to use the most appropriate types used by your fields. Some of the fieldTypes are just basic mappings (defined at the top string, integer etc). The more sophisticated ones define what analysis are used on a per field basis for both index time and query time. A useful tip might be - if you are using a field type for one of your field mappings but its not quite
doing what you want. Try copying the block of XML associated with the fieldType you are using. Use the new one and tweak its settings - using a different FilterFactory for example.

Finally there are some additional parameters at the bottom of the XML file that define defaults or allow you to copy fields.

as follows
<uniqueKey>id</uniqueKey> - you must specify this  (self explanatory)
 <solrQueryParser defaultOperator="OR"/>
in the solrQueryParser specify what you want the default operator is do you want searchs for "cat+dog" to find documents containing both  "cat AND dog" or do you want it to find documents containing "cat OR dog" ? this can be changed at query time if needed.

<copyField source="srcfield" dest="destfield"/>

you can also configure "copyFields" this is useful if you want a field to be indexed / queried
using two different methods. define these afterwards.

I hope this helps lift some of the mystery of this configuration file.

Nick ...

Wednesday 23 March 2011

Disabling Adverts in Spotify on Ubuntu Linux

Ok .. this is completely unrelated to my Solr stuff.
But - I wanted to share this useful snippet.

(NB)

* This does not actually disable the adverts - it simply mutes the volume for approximately the same time as a spotify ad lasts.
* Its not an automatic solution - Its only really useful if you are listening to spotify whilst you are sitting at the keyboard.
* It works for me!

The reason I wrote this is because some of the Spotify adverts were

* Not relevant to me ( I am most definitely not interested in Tiny Tempah )
* Really irritating (Go Compare / Compare the Meerkat)

So I could have ponied up and gone for the subscription , However there are a few reasons for this.

* My spotify usage is not really very high. (I have my entire CD collection already ripped and sitting locally)
* The Linux version is not up to scratch i use the Windows version through Wine.

I might re-evaluate this at some point in the future - this is how it is now.

So heres how it works

I created a small Shell script to mute the volume for 45seconds , and then bring it back up again.
The script for ubuntu is as follows.


#!/bin/bash

pactl set-sink-mute 0 yes
sleep 45
pactl set-sink-mute 0 no


Save this script somewhere ( eg ~/mute.sh )
make it executable
#> chmod u+x ~/mute.sh
Now you are going to create a custom keybinding in Ubuntu
Go to 
System -> Preferences -> Keyboard Shortcuts
Click "Add"
you will get a pop-up window asking for an name and command
Call it  "MuteVol1Min"
and set the command to "~/mute.sh" (or wherever you saved the script)
Click Apply
Finally you need to bind it to a key combo.
click on the "shortcut" column next to the custom shortcut you just created.
I bound mine to the windows key + pause/break key - that comes up as Mod4+Pause.
You could choose the same combination - or try something different.
To use it - as soon as you notice that an advert comes up hit the key combo - sound will mute and come back up 45seconds later after the advert is finished.
Its not perfect but it works well for me as im usually listening and coding at the same time.
Hope this helps someone
N...

Tuesday 22 March 2011

Installing SOLR on EC2

Detailed below are my steps for installing Solr on EC2 with Ubuntu

Before we go any further , i would like you to note that I am making the assumption that 
you have already signed up for Amazon Web Services and you have also downloaded the command line tools.

If you are not there yet then you are going to have to follow the following resources 
before going any further.


* Install the command line tools https://help.ubuntu.com/community/EC2StartersGuide

Once you are ready please continue.

As of now Lucidworks do not provide an AMI for installing Solr outside of the US.
I am going to need to create my own AMI to solve this problem.
Fortunately this is a fairly simple task - especially if you use the nightly build.

You will need to start by creating a new instance.

Using the ec2 command line tool "ec2-run-instances" fire up a new EC2 Instance based off Ubuntu
The format is as follows

ec2-run-instances AMI-IMAGE-ID --instance-type INSTANCE-TYPE --region eu-west-1 -k AMAZON-KEY

For the purposes of the demo i am using a t1.micro just so we can get the hang of the process. In no way am I recommending that you run the whole SOLR stack on a t1.micro - that's crazy talk. Personally I wouldnt recommend anything lower than a m1.large for production unless you have a very specific use case. eg: very low traffic, very small document set. (Please evaluate sphinx to make sure SOLR is right for you - life might get a lot easier ) Anyhow..

to determine the AMI you wish to use you can query AWS using "ec2-describe-images"
eg


#> ec2-describe-images --region eu-west-1 --all | grep 'ubuntu'


There are plenty of AMI's to choose from so pipe the output into grep to help search for the one 
you are looking for. Or just choose the same one as me. Pick one with "ebs" store you will need this especially if you are testing out a micro instance. The AMI id's are listed in the second column.

Once you are ready to roll issue the following command to fire up the instance...





#> ec2-run-instances ami-e974439d --instance-type t1.micro --region eu-west-1 -k myAWSkey
When this has completed output on the command line will contain the id of the instance you have just fired up. You will now use this information to get the public DNS of the instance. (This is needed so you can log into the VM)

The standard ubuntu instances can be accessed through SSH. 

#> ssh -i /etc/aws-keys/YOUR-AWS-KEY.pem ubuntu@your.public.aws.dns.eu-west.compute.amazon.com
From here on in you are in a standard ubuntu install - so you can go ahead and configure it how you might need. However for our purposes we are just going to set up a search user and place the latest SOLR nightly there for further usage.

#> sudo useradd -d /home/search -m search
Now lets go to that newly created folder and install the nightly

#> sudo useradd -d /home/search -m search
#> cd /tmp
#> wget http://mirror.lividpenguin.com/pub/apache//lucene/solr/1.4.1/apache-solr-1.4.1.tgz
#> gunzip apache-solr-1.4.1.tgz
#> tar -xvf apache-solr-1.4.1.tar
Now lets move it to the search users folder

#> mv /tmp/apache-solr-1.4.1 /home/search

Before we can run Solr for the first time - you will need to install JAVA

#> sudo apt-get update
#> sudo apt-get install openjdk-6-jre-headless
If you have any problems with the install failing try again adding --fix-missing
Thats basically it for now you can  run solr by issuing the 
following command

#> sudo java -server -jar /home/search/apache-solr-1.4.1/example/start.jar
Of course, right now this isnt going to be of much use to you! 
But it at least gives you some steps to go through in order to get a Solr server up and running on EC2.

If you have been following my other posts , or maybe you have a Solr set-up from the nightlies running in a development sandbox. You could quite easily zip up the entire folder and install it on the instance by using "scp" to copy it up - and then follow the rest of the steps.

I hope this helps get you started.

I will probably be doing a follow-up post on this tommorow , so keep posted and let me know if you have any issues.

N.


Wednesday 2 March 2011

Solr : last_index_time -- explained

There seems to be some confusion about what solr's "last_index_time" means.

This is a fairly important thing to understand when setting Solr up to do its Delta Updates.

Solr's last_index_time holds the timestamp that is used to determine the last time an indexing operation STARTED,  not as some people believe when it ended.

When performing updates this value is usually used in the delta query to select records which have been modified or updated since the last indexing operation begun.  It is critical to ensure that there is no timeframe window between the "last_index_time" when doing your first full-import and your scheduled updates. Otherwise this could cause you to have some documents missing from your index.

N...

Thursday 17 February 2011

Solr Configuration Options : Slow updates ( mergeFactor )

If you are experiencing slow Import / Updates there is a setting in solrconfig.xml that can dramatically affect the indexing time.


As described in the Solr Documentation mergeFactor is an important consideration. The only issue being the higher you set the mergeFactor the slower the response for a search query is going to be . The question then becomes how much that matters to you. If your index only needs updating every week then you could set the mergeFactor to a low setting and get rapid searches. On the other hand if you need your search results updated more frequently then you will have to set it higher (depending on the amount of content in your domain.) Possibly having a negative impact on the speed of your search responses.

 

mergeFactor Tradeoffs


High value merge factor (e.g., 25):
  • Pro: Generally improves indexing speed
  • Con: Less frequent merges, resulting in a collection with more index files which may slow searching
Low value merge factor (e.g., 2):
  • Pro: Smaller number of index files, which speeds up searching.
  • Con: More segment merges slow down indexing. 


Tuesday 8 February 2011

Configuring Solr 1.4.1

Those of you following my posts will recall that I inherited an already configured version of Solr. The trouble is, it was basically just the "example" version of Solr with a few tweaks.

According to "The 7 Deadly Sins of Solr" its incredibly common to find Solr installs which are (like mine) just modifications of the example app - mine even stretches as far as still containing "Solr Rocks" in the solrconfig.xml file.

So without further ado im going to fix those two issues.

To start with ill be renaming the "example" folder to something proper.
You can call it whatever you feel is appropriate; just remember that from
now i will be referring to the search app as "ProductIndex"

%> mv example ProductIndex


Next thing i am going to do is get rid of the "Solr Rocks" and see if we can tidy up the config.

The main config file for solr is located in "ProductIndex/solr/conf"
lets see what else is there.

%> cd ProductIndex/solr/conf
%> ls -1

admin-extra.html
elevate.xml
mapping-ISOLatin1Accent.txt
protwords.txt
schema.xml
scripts.conf
solrconfig.xml
spellings.txt
stopwords.txt
synonyms.txt
xslt





The file we are interested in is the "solrconfig.xml"

Lets attack it

%> vim solrconfig.xml


Ok so lets find the "solr rocks" section and see whats going on there

in vim

esc :417

will take you directly to the line in question

in context you will see something like this

<!-- a firstSearcher event is fired whenever a new searcher is being
         prepared but there is no current registered searcher to handle
         requests or to gain autowarming data from. -->
<listener class="solr.QuerySenderListener" event="firstSearcher">
<arr name="queries">
<lst> <str name="q">solr rocks</str><str name="start">0</str><str name="rows">10</str></lst>
<lst><str name="q">static firstSearcher warming query from solrconfig.xml</str></lst>
</arr>
</listener>

So whats all that mean ?

Well - basically it sets up a simple warming query that will get run if there is no configuration
set up for any requests that might be made. (kind of like it says in the comment). It is usually used
when Solr is first started to warm up the cache. right now this is not going to be of any use unless by some stroke of coincidence you have documents containing the search term "Solr Rocks"

So lets put something sensible in there like a really common query. You can specify several if you wish

Im going to change mine to say the following.

 <!-- a firstSearcher event is fired whenever a new searcher is being
         prepared but there is no current registered searcher to handle
         requests or to gain autowarming data from. -->
    <listener event="firstSearcher" class="solr.QuerySenderListener">
      <arr name="queries">
        <lst> <str name="q">Tiesto</str><str name="start">0</str><str name="rows">10</str></lst>
      </arr>
    </listener>


Now reading further the comment in the code says this will get run when there is no current registered searcher.