In the software level,
Can a search query have spelling mistakes?
Should stop words (Ex. a, the) be filtered?
What about a phrase search given non-exact phrase?
In the operation level,
Should the search be decoupled from the app machines?
Should the search be distributed? If so, how many shards, replicas should be there?
Doing a quick search would tell you that Apache Lucene is the industry standard. There are two popular abstractions on top of Lucene: Solr and ElasticSearch (ES).
There are a lot of debates on which one should be used. I choose ES because
- it's distributed by design
- easier to integrate for AWS EC2
The following post will talk about how you can install ElasticSearch in your linux machine (I like to use the ubuntu 12.04 build from EC2).
Download elasticsearch from elasticsearch.org. Extract the files and put it into a folder of your choice (Ex. /opt/tools).
cd /opt/toolsYou can start elasticsearch by:
wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-0.90.5.zip
unzip elasticsearch-0.90.5.zip
bin/elasticsearch -fYou may want to tweak the Xmx (max memory size the heap can reach for the JVM) and Xms (the inistal heap memory size for the JVM) values.
bin/elasticsearch -f -Xmx2g -Xms2g -Des.index.storage.type=memory -Des.max-open-files=trueYou can also run it as a service using the script located in bin/service.
After you started your service, visit "http://localhost:9200" in the browser. You should see the following:
{
"ok" : true,
"status" : 200,
"name" : "Solitaire",
"version" : {
"number" : "0.90.5",
"build_hash" : "c8714e8e0620b62638f660f6144831792b9dedee",
"build_timestamp" : "2013-09-17T12:50:20Z",
"build_snapshot" : false,
"lucene_version" : "4.4"
},
"tagline" : "You Know, for Search"
}
No comments:
Post a Comment