developer24hours: elastic search

Showing posts with label elastic search. Show all posts

Wednesday, October 9, 2013

Elastic Search on EC2 - Install ES cluster on Amazon Linux AMI

We will install ElasticSearch (ES) on a EC2 instance.

Here's the specs:

Amazon Linux AMI 2013.09
Medium instance
64-bit machine
Elastic Search 0.90.5
Spring MVC
Maven

Begin by launching an instance. You may get an out of memory error in /var/log/syslog if you use a micro instance when you launch a machine. If you are not sure how to launch an instance, read Amazon EC2 - Launching Ubuntu Server 12.04.1 LTS step by step guide.

For the security group, you will need to open the following ports:

22 (SSH)
9300 (ElasticSearch Transport)
9200 (HTTP Testing)

Attach Two EBS drives

We will be using one for saving data and one for logging. Create and attach two EBS drives in the AWS console.

You will have two volumes: /dev/xvdf and /dev/xvdg. Let's format them using XFS.

yum -y install xfsprogs xfsdump
sudo mkfs.xfs /dev/xvdf
sudo mkfs.xfs /dev/xvdg

Make the data drive /vol. Make the log drive /vol1.

vi /etc/fstab

Append the following:

/dev/xvdf /vol xfs noatime 0 0
/dev/xvdg /vo1 xfs noatime 0 0

Mount the drives

mkdir /vol
mkdir /vol1
mount /vol
mount /vol1

Read Amazon EC2 - Mounting a EBS drive for more information.

ssh into the instance

ssh -i {key} ubuntu@{ec2_public_address}

Update the machine

sudo yum -y update

Install Oracle Sun Java

In order to run ES efficiently, a JVM must be able to allocate large virtual address space and perform garbage collection on large heaps without pausing JVM. There are also some stories online talking about OpenJDK is not as good as Oracle Java for ES. Feel free to let me know in the comments below if this is not the case.

Download Java 7 from Oracle.

Put it in /usr/lib/jvm.

Extract and install it

tar -zxvf jdk-7u40-linux-x64.gz

Rename the folder from jdk1.7.0_40 to jdk1.7.0

You should now have jdk1.7.0 inside /usr/lib/jvm

Set java, javac.

sudo /usr/sbin/alternatives --install "/usr/bin/java" "java" "/usr/lib/jvm/jdk1.7.0/bin/java" 1
sudo /usr/sbin/alternatives --install "/usr/bin/javac" "javac" "/usr/lib/jvm/jdk1.7.0/bin/javac" 1

Correct the permissions.

sudo chmod a+x /usr/bin/java
sudo chmod a+x /usr/bin/javac
sudo chown -R root:root /usr/lib/jvm/jdk1.7.0

Set to the Sun Java by:

sudo /usr/sbin/alternatives --config java

Check your java version.

java -version

Download and install ElasticSearch

Download ElasticSearch (Current version as of this writing is 0.90.5).

sudo su
mkdir /opt/tools
cd /opt/tools
wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-0.90.5.zip
unzip elasticsearch-0.90.5.zip

Install ElasticSearch Cloud AWS plugin.

cd elasticsearch-0.90.5
bin/plugin -install elasticsearch/elasticsearch-cloud-aws/1.15.0

Configuring ES

AWS can shut down your instances at any time. If you are storing indexed data in ephemeral drives, you will lose all the data when all the instances are shut down.

There are were two ways to persist data:

Store data in EBS via local gateway
Store data in S3 via S3 gateway

A restart of the nodes would begin to recover data from the gateway. The EBS route is better for performance, while the S3 route is better for persistence [S3 is deprecated].

We will be setting up a ES cluster and use a local gateway. S3 gateway is deprecated at the time of this writing. The ES team has promised a new backup mechanism in the future.

vi /opt/tools/elasticsearch-0.90.5/config/elasticsearch.yml

cluster.name: mycluster

cloud:

aws:

access_key:

secret_key:
region: us-east-1

discovery:

type: ec2

We have specified a cluster called "mycluster" above. You will need to input your aws access keys and create a S3 bucket.

We also need to ensure the JVM does not swap by doing two things:

1) Locking the memory (find this setting inside elasticsearch.yml)

bootstrap.mlockall: true

2) Set ES_MIN_MEM and ES_MAX_MEM to the same value. It is also recommended to set them to half of the system's available ram. We will set this in the ElasticSearch Service Wrapper later in the article.

Create the data and log paths.

mkdir /vol/elasticsearch/data
mkdir /vol1/elasticsearch/log

Set the data and log paths in /config/elasticsearch.yml

path.data: /vol/elasticsearch/data
path.logs: /vol1/elasticsearch/logs

Let's edit config/logging.yml

vi /opt/tools/elasticsearch-0.90.5/config/logging.yml

Edit these settings and make sure these lines are uncommented and present

logger:

gateway: DEBUG

org.apache: WARN

discovery: TRACE

Testing the cluster

bin/elasticsearch -f

Browse to the ec2 address at port 9200

http://ec2-XX-XXX-XXX-XXX.compute-1.amazonaws.com:9200/

You should see the following:

{
  "ok" : true,
  "status" : 200,
  "name" : "Storm",
  "version" : {
    "number" : "0.90.5",
    "build_hash" : "c8714e8e0620b62638f660f6144831792b9dedee",
    "build_timestamp" : "2013-09-17T12:50:20Z",
    "build_snapshot" : false,
    "lucene_version" : "4.4"
  },
  "tagline" : "You Know, for Search"

}

Installing ElasticSearch as a Service

We will be using the ElasticSearch Java Service Wrapper.

Download the service wrapper and move it to bin/service.

curl -L -k http://github.com/elasticsearch/elasticsearch-servicewrapper/tarball/master | tar -xz
mv /service /opt/tools/elasticsearch-0.90.5/bin

Make ElasticSearch to start automatically when system reboots.

bin/service/elasticsearch install

Make ElasticSearch Service a defaul command (we will call this es_service)

ln -s /opt/tools/elasticsearch-0.90.5/bin/service/elasticsearch /usr/bin/es_service

Start the service

es_service start

You should see:

Starting ElasticSearch...
Waiting for ElasticSearch......
running: PID:2503

Tweaking the memory settings

There will be three settings you want to care about:

ES_HEAP_SIZE
ES_MIN_MEM
ES_MAX_MEM

It is recommended to set ES_MIN_MEM to be the same as ES_MAX_MEM. However, you can just set ES_HEAP_SIZE as it will be assigned to both ES_MIN_MEM and ES_MAX_MEM.

We will be tweaking these settings in the service wrapper's elasticsearch.conf instead of elasticsearch's.

vi /opt/tools/elasticsearch-0.90.5/bin/service/elasticsearch.conf

set.default.ES_HEAP_SIZE=1024

There are a few things you need to beware of.

You need to leave some memory for the OS for non elasticsearch operations. Try leaving at least half of the available memory.
As a reference, use 1024Mb for every 1 million documents you are saving.

Restart the service.

Wednesday, October 2, 2013

ElasticSearch - Indexing via Java API

There are many ways to populate your data to the ElasticSearch data store. The most primitive way is to populate via the REST API via PUT or POST requests.

In this tutorial, we will be populating via the Java API. I have data in MySQL and my Web application is based on Spring.

Here's my setup:

Ubuntu 12.04 Amazon EC2
JDK 1.7
Spring 3.2
MySQL

Install ElasticSearch (ES) via Maven

Put the following into your pom.xml file.

Make sure you also installed the same version of ES on your server. Read How to Install ElasticSearch on EC2.

Let's create a search service called ElasticSearchService:

Interface:

Implementation:

We will be using the ElasticSearch's native Java API. We will connect to the ElasticSearch cluster using the Client object. Using the XContentBuilder, we can construct JSON wrapper of the category objects. The category data is stored in MySQL and retrieved by the categoryDao object. Finally, a HTTP GET request will put the data into the ES cluster.

Let's create the interface that you can invoke the call.

Interface:

Implementation:

Monday, September 30, 2013

ElasticSearch Query - how to insert and retreive search data

ElasticSearch uses HTTP Methods (ex. GET, POST, PUT, DELETE) to retrieve, save, and delete search data from its index.

For simplicity, we will use curl to demonstrate some usages. If you haven't done so already, start ElasticSearch in your terminal.

Adding a document

We will send a HTTP POST request to add the subject "sports" to an index. The request will have the following form:

curl -XPOST "http://localhost:9200/{index}/{type}/{id}" -d '{"key0": "value0", ... , "keyX": "valueX"}'

Example:

curl -XPOST "http://localhost:9200/subjects/subject/1" -d '{"name": "sports", "creator": {"first_name":"John", "last_name":"Smith"}}'

Retrieving the document

We can get back the document by sending a GET request.

curl -X GET "http://localhost:9200/subjects/_search?q=sports"

We can also use a POST request to query the above.

curl -X POST "http://localhost:9200/subjects/_search" -d '{
"query": {"term":{"name":"sports"}}
}'

Both of the above will give you the following:

{"took":1,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":1,"max_score":0.30685282,"hits":[{"_index":"subjects","_type":"subject","_id":"1","_score":0.30685282, "_source" : {"name": "sports"}}]}}

The _source filed above holds the results for the query.

To search based on the nested properties (Ex. first_name, last_name), we can do the following:

curl -XGET "http://localhost:9200/subjects/_search?q=subject.creator.first_name:John"
curl -XGET "http://localhost:9200/subjects/subject/_search?q=creator.first_name:John"
curl -XGET "http://localhost:9200/subjects/subject/_search?q=subject.creator.first_name:John"

All the above queries will return the same results.

Deleting the document

Similarly, we can delete the subject index by a DELETE request.

curl -X DELETE "http://localhost:9200/subjects"

Creating Document with settings and mappings

If you want to adjust settings like number of shards and replicas, you may find the following useful. The more shards you have, the better the indexing performance. The more replicas you have, the better the searching performance.

curl -X PUT "http://localhost:9200/subjects" -d '
{"settings":{"index":{"number_of_shards":3, "number_of_replicas":2}}},
{"mappings":{"document": {
"properties": {
"name" : {"type":string, "analyzer":"full_text"}
}
}
}
}'

The above created an index called subjects. Each document in the index has a property called name.

Checking the Mapping

curl -X GET "http://localhost:9200/subjects/_mapping?pretty=true"

You should see

{
"subjects" : { }
}

The pretty parameter above just formats the JSON result in a human readable format.

How to Install ElasticSearch on EC2

Search is not easy. There are a lot of things you need to consider.

In the software level,

Can a search query have spelling mistakes?
Should stop words (Ex. a, the) be filtered?
What about a phrase search given non-exact phrase?

In the operation level,

Should the search be decoupled from the app machines?
Should the search be distributed? If so, how many shards, replicas should be there?

Doing a quick search would tell you that Apache Lucene is the industry standard. There are two popular abstractions on top of Lucene: Solr and ElasticSearch (ES).

There are a lot of debates on which one should be used. I choose ES because

it's distributed by design
easier to integrate for AWS EC2

The following post will talk about how you can install ElasticSearch in your linux machine (I like to use the ubuntu 12.04 build from EC2).

Download elasticsearch from elasticsearch.org. Extract the files and put it into a folder of your choice (Ex. /opt/tools).

cd /opt/tools
wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-0.90.5.zip
unzip elasticsearch-0.90.5.zip

You can start elasticsearch by:

bin/elasticsearch -f

You may want to tweak the Xmx (max memory size the heap can reach for the JVM) and Xms (the inistal heap memory size for the JVM) values.

bin/elasticsearch -f -Xmx2g -Xms2g -Des.index.storage.type=memory -Des.max-open-files=true

You can also run it as a service using the script located in bin/service.

After you started your service, visit "http://localhost:9200" in the browser. You should see the following:

{
"ok" : true,
"status" : 200,
"name" : "Solitaire",
"version" : {
"number" : "0.90.5",
"build_hash" : "c8714e8e0620b62638f660f6144831792b9dedee",
"build_timestamp" : "2013-09-17T12:50:20Z",
"build_snapshot" : false,
"lucene_version" : "4.4"
},
"tagline" : "You Know, for Search"
}