Monday, September 30, 2013

ElasticSearch Query - how to insert and retreive search data

ElasticSearch uses HTTP Methods (ex. GET, POST, PUT, DELETE) to retrieve, save, and delete search data from its index.

For simplicity, we will use curl to demonstrate some usages. If you haven't done so already, start ElasticSearch in your terminal.


Adding a document

We will send a HTTP POST request to add the subject "sports" to an index. The request will have the following form:
curl -XPOST "http://localhost:9200/{index}/{type}/{id}" -d '{"key0":  "value0", ... , "keyX": "valueX"}'
Example:
curl -XPOST "http://localhost:9200/subjects/subject/1" -d '{"name":  "sports",  "creator": {"first_name":"John", "last_name":"Smith"}}'

Retrieving the document

We can get back the document by sending a GET request.
curl -X GET "http://localhost:9200/subjects/_search?q=sports"
We can also use a POST request to query the above.
curl -X POST "http://localhost:9200/subjects/_search" -d '{
"query": {"term":{"name":"sports"}}
}'
Both of the above will give you the following:
{"took":1,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":1,"max_score":0.30685282,"hits":[{"_index":"subjects","_type":"subject","_id":"1","_score":0.30685282, "_source" : {"name":  "sports"}}]}}
The _source filed above holds the results for the query.

To search based on the nested properties (Ex. first_name, last_name), we can do the following:
curl -XGET "http://localhost:9200/subjects/_search?q=subject.creator.first_name:John"
curl -XGET "http://localhost:9200/subjects/subject/_search?q=creator.first_name:John"
curl -XGET "http://localhost:9200/subjects/subject/_search?q=subject.creator.first_name:John" 
All the above queries will return the same results.


Deleting the document

Similarly, we can delete the subject index by a DELETE request.
curl -X DELETE "http://localhost:9200/subjects"

Creating Document with settings and mappings

If you want to adjust settings like number of shards and replicas, you may find the following useful. The more shards you have, the better the indexing performance. The more replicas you have, the better the searching performance.
curl -X PUT "http://localhost:9200/subjects" -d '
{"settings":{"index":{"number_of_shards":3, "number_of_replicas":2}}},
{"mappings":{"document": {
                             "properties": {
                                 "name" : {"type":string, "analyzer":"full_text"}
                             }
                         }
                       }
}'
The above created an index called subjects. Each document in the index has a property called name.


Checking the Mapping
curl -X GET "http://localhost:9200/subjects/_mapping?pretty=true"
You should see
{
  "subjects" : { }
}
The pretty parameter above just formats the JSON result in a human readable format.

How to Install ElasticSearch on EC2

Search is not easy. There are a lot of things you need to consider.

In the software level,

Can a search query have spelling mistakes?
Should stop words (Ex. a, the) be filtered?
What about a phrase search given non-exact phrase?

In the operation level,

Should the search be decoupled from the app machines?
Should the search be distributed? If so, how many shards, replicas should be there?

Doing a quick search would tell you that Apache Lucene is the industry standard. There are two popular abstractions on top of Lucene: Solr and ElasticSearch (ES).

There are a lot of debates on which one should be used. I choose ES because
  • it's distributed by design
  • easier to integrate for AWS EC2

The following post will talk about how you can install ElasticSearch in your linux machine (I like to use the ubuntu 12.04 build from EC2).

Download elasticsearch from elasticsearch.org. Extract the files and put it into a folder of your choice (Ex. /opt/tools).
cd /opt/tools
wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-0.90.5.zip
unzip elasticsearch-0.90.5.zip
You can start elasticsearch by:
bin/elasticsearch -f
You may want to tweak the Xmx (max memory size the heap can reach for the JVM) and Xms (the inistal heap memory size for the JVM) values.
bin/elasticsearch -f -Xmx2g -Xms2g -Des.index.storage.type=memory -Des.max-open-files=true
You can also run it as a service using the script located in bin/service.

After you started your service, visit "http://localhost:9200" in the browser. You should see the following:

{
  "ok" : true,
  "status" : 200,
  "name" : "Solitaire",
  "version" : {
    "number" : "0.90.5",
    "build_hash" : "c8714e8e0620b62638f660f6144831792b9dedee",
    "build_timestamp" : "2013-09-17T12:50:20Z",
    "build_snapshot" : false,
    "lucene_version" : "4.4"
  },
  "tagline" : "You Know, for Search"
}

Thursday, September 26, 2013

Java reading and writing file line by line

We will be using BufferedReader to read a structured file line by line and then using BufferedWriter to write it out.

The example takes some structured data and creates MySQL insert statements for each dataset and then outputs it as a file.