Thursday, October 3, 2013

ElasticSearch - Defining the Mapping Schema

The previous posts demonstrate how easy it is to index some words and retrieve them via the REST or Java API.  However, we never really talk about how to tweak the searches to fit our needs.

Consider a subject object with two properties like the following:
{
  "name":"The Old & New British English",
  "code":12345
}
Say we have a list of subjects like the above and we want to index and search subjects with the following requirement:
  1. search by exact subject name
  2. search with stop words removed, accent characters conversion
  3. search with some spelling mistakes allowed
  4. search with some words skipped
  5. search by exact code
Without specifying the mapping, ElasticSearch (ES) will use the standard analyzer.

Before we define the ES schema, let's get familiar with the following terms.

A mapping defines how properties (Ex. "name" and "code" properties above) are indexed and searched through analyzers and tokenizers.

An analyzer is a group of filters executed in-order.
Reference: Analyzers

A filter is a function that transforms data (lowercase, stop-word removal, phonetics).
Reference: Token Filters

When we search/index for the phrase "The Old & New British English", an analyzer will break down the phrase into words through tokenizers. Each word/token is then passed through a bunch of token filters.  For example, a lowercase token filter will normalize the incoming words to lowercased words.

For another explanation, refer to this post for a better understanding of analyzers.

The following defines a simple mapping with index=subjects, id=subject, and two properties (name, code).

curl -X PUT "http://localhost:9200/subjects" -d '
{
  "mappings":{
     "subject":{
          "properties":{
            "name":{
              "type":"string"
            },
          "code":{
              "type":"string"
          }
    }
  }
}'


1.) Search by exact subject name

This is very easy. We will make the "name" field not indexed.

"subject":{
          "properties":{
            "name":{
              "type":"string"
              "index":"not_analyzed"
            }
         }

Let's popular the index.

curl -XPUT http://localhost:9200/subjects/subject/1 -d '
{
  "name":"The Old & New British English",
  "code":12345
}'

Try to do a search on the phrase "The Old & New British English"

curl -X GET "http://localhost:9200/subjects/_search?pretty=true" -d '{
    "query" : {
        "text" : { "name": "The Old & New British English" }
    }
}'

Now try to search with "the Old & New British English" or "Old & New British English". This is not very helpful since most people won't search with case-sensitivity or exact phrases.

Let's delete this mapping.

curl -X DELETE "http://localhost:9200/subjects"


2) Search with stop words removed, accent characters conversion

Let's use a new custom analyzer called "full_name".

curl -X PUT "http://localhost:9200/subjects" -d '
{
  "mappings":{
      "subject":{
          "properties":{
            "name":{
              "type":"string",
              "analyzer":"full_name"
            }
          }
      }
  }
}

To customize the way searches would work, we need to tweak the analyzer settings.  The general form of defining the settings is as follows:

"settings":{
    "analysis":{
        "filter":{
        }
    },
    "analyzer":{
        "full_name":{
            "filter":[
            ],
            "type":"custom",
            "tokenizer":"standard"
        }
    }
}

We want "subject" to be searchable with stop words removed and normalized accent characters (so that the accent e can be searchable by by an 'e').

"settings":{
    "analysis":{
        "filter":{
        }
    },
    "analyzer":{
        "full_name":{
            "filter":[
                "standard",
                "lowercase",
                "asciifolding"
            ],
            "type":"custom",
            "tokenizer":"standard"
        }
    }
}

The lowercase filter normalizes token text to lower case. Since an analyzer is used both in the index time and search time, the lowercase filter will allow case-insensitivity searches.

Let's populate the schema to the ES cluster:

curl -X PUT "http://localhost:9200/subjects" -d '
{
  "mappings":{
      "subject":{
          "properties":{
            "name":{
              "type":"string",
              analyzer:"full_name"
            }
          }
      }
  },
  "settings":{
    "analysis":{
      "analyzer":{
        "full_name":{
          "filter":[
            "standard",
            "lowercase",
            "asciifolding"
          ],
          "type":"custom",
          "tokenizer":"standard"
        }
      }
    }
  }
}'

Populate ES with "The Old & New British English".

Search for the following:
  • "The Old & New British English"
  • "old & new british english"
  • "british english"
  • "british hello english"
  • "engliah"

All of the above, expect the last one, should return the result.


3) Search with some spelling mistakes allowed

To make the search work for "engliah", we need to use the filter edgeNGram.  edgeNGram takes in two parameters: "min_gram", "max_gram".

For the term "apple" with min_gram=3, max_gram=5, ES will index it with:
  • app
  • appl
  • apple
Let's try this.

curl -X PUT "http://localhost:9200/subjects" -d '
{
  "mappings":{
      "subject":{
          "properties":{
            "name":{
              "type":"string",
              "analyzer":"partial_name"
            }
          }
      }
  },
  "settings":{
    "analysis":{
      "filter":{
        "name_ngrams": {
          "max_gram":10,
          "min_gram":2,
          "type": "edgeNGram"
        }
      },
      "analyzer":{
        "partial_name":{
          "filter":[
            "standard",
            "lowercase",
            "asciifolding",
            "name_ngrams"
          ],
          "type":"custom",
          "tokenizer":"standard"
        }
      }
    }
  }
}'

Use _analyze to check how the phrase will be indexed.

curl -X GET "http://localhost:9200/subjects/_analyze?analyzer=partial_name&pretty=true" -d 'The Old & New British English'

Try to search for the term "engliah".  You should see the result showing up.


4) Search with some words skipped

This is already working by 3) above.


5) Search by exact code

"subject":{
          "properties":{
            "code":{
              "type":"string"
              "index":"not_analyzed"
            }
         }

You can accomplish this with 1) or 2) above.  For the purpose of accomplishing the exact search, if case-sensitivity is important for you, use 1), else use 2).  I am opting 1) above.


Putting all these together

To accommodate for different search formats, we need to specify "subject" as a multi-field.

"subject":{
      "properties":{
        "name":{
          "fields":{
            "name":{
              "type":"string",
              "index":"not_analyzed"
            },
            "partial":{
                "type":"string",
                "search_analyzer":"full_name",
                "index_analyzer":"partial_name"
             }
          },
          "type":"multi_field"
        }
      }

You can access "name" by "name.name", or just "name".  This is the default field for "name" and it is defaulted to "full_name" - exact search.

You can access "partial" by "name.partial".  This is the NGram search (spelling mistakes allowed).  We are indexing the words with NGram variations, but using the exact term to search.

For example, consider a search for the term "app" within a data store with the following:
apples
appetizer
apes

If both search_analyzer and index_analyzer are using "partial_name", all three terms above will be returned.

If the search_analyzer is "full_name" and index_analyzer is "partial_name", then only "apples" and "appetizer" will be returned.  This is the desired case.

Now putting the mapping all together:

curl -X PUT "http://localhost:9200/subjects" -d '
{
  "mappings":{
      "subject":{
          "properties":{
            "name":{
              "fields":{
                  "name":{
                      "type":"string",
                      "analyzer":"full_name"
                  },
                  "partial":{
                      "type":"string",
                      "search_analyzer":"full_name",
                      "index_analyzer":"partial_name"
                  }
              }
            },
            "code":{
                "type":"string",
                "analyzer":"full_name"
            }
          }
      }
  },
  "settings":{
    "analysis":{
      "filter":{
        "name_ngrams": {
          "max_gram":10,
          "min_gram":2,
          "type": "edgeNGram"
        }
      },
      "analyzer":{
        "full_name":{
          "filter":[
            "standard",
            "lowercase",
            "asciifolding"
          ],
          "type":"custom",
          "tokenizer":"standard"
        },
        "partial_name":{
          "filter":[
            "standard",
            "lowercase",
            "asciifolding",
            "name_ngrams"
          ],
          "type":"custom",
          "tokenizer":"standard"
        }
      }
    }
  }
}'

Wednesday, October 2, 2013

ElasticSearch - Indexing via Java API

There are many ways to populate your data to the ElasticSearch data store. The most primitive way is to populate via the REST API via PUT or POST requests.

In this tutorial, we will be populating via the Java API. I have data in MySQL and my Web application is based on Spring.

Here's my setup:

  • Ubuntu 12.04 Amazon EC2
  • JDK 1.7
  • Spring 3.2
  • MySQL


Install ElasticSearch (ES) via Maven

Put the following into your pom.xml file.

Make sure you also installed the same version of ES on your server. Read How to Install ElasticSearch on EC2.

Let's create a search service called ElasticSearchService:

Interface:



Implementation:

We will be using the ElasticSearch's native Java API. We will connect to the ElasticSearch cluster using the Client object. Using the XContentBuilder, we can construct JSON wrapper of the category objects. The category data is stored in MySQL and retrieved by the categoryDao object. Finally, a HTTP GET request will put the data into the ES cluster.



Let's create the interface that you can invoke the call.

Interface:



Implementation:

Monday, September 30, 2013

ElasticSearch Query - how to insert and retreive search data

ElasticSearch uses HTTP Methods (ex. GET, POST, PUT, DELETE) to retrieve, save, and delete search data from its index.

For simplicity, we will use curl to demonstrate some usages. If you haven't done so already, start ElasticSearch in your terminal.


Adding a document

We will send a HTTP POST request to add the subject "sports" to an index. The request will have the following form:
curl -XPOST "http://localhost:9200/{index}/{type}/{id}" -d '{"key0":  "value0", ... , "keyX": "valueX"}'
Example:
curl -XPOST "http://localhost:9200/subjects/subject/1" -d '{"name":  "sports",  "creator": {"first_name":"John", "last_name":"Smith"}}'

Retrieving the document

We can get back the document by sending a GET request.
curl -X GET "http://localhost:9200/subjects/_search?q=sports"
We can also use a POST request to query the above.
curl -X POST "http://localhost:9200/subjects/_search" -d '{
"query": {"term":{"name":"sports"}}
}'
Both of the above will give you the following:
{"took":1,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":1,"max_score":0.30685282,"hits":[{"_index":"subjects","_type":"subject","_id":"1","_score":0.30685282, "_source" : {"name":  "sports"}}]}}
The _source filed above holds the results for the query.

To search based on the nested properties (Ex. first_name, last_name), we can do the following:
curl -XGET "http://localhost:9200/subjects/_search?q=subject.creator.first_name:John"
curl -XGET "http://localhost:9200/subjects/subject/_search?q=creator.first_name:John"
curl -XGET "http://localhost:9200/subjects/subject/_search?q=subject.creator.first_name:John" 
All the above queries will return the same results.


Deleting the document

Similarly, we can delete the subject index by a DELETE request.
curl -X DELETE "http://localhost:9200/subjects"

Creating Document with settings and mappings

If you want to adjust settings like number of shards and replicas, you may find the following useful. The more shards you have, the better the indexing performance. The more replicas you have, the better the searching performance.
curl -X PUT "http://localhost:9200/subjects" -d '
{"settings":{"index":{"number_of_shards":3, "number_of_replicas":2}}},
{"mappings":{"document": {
                             "properties": {
                                 "name" : {"type":string, "analyzer":"full_text"}
                             }
                         }
                       }
}'
The above created an index called subjects. Each document in the index has a property called name.


Checking the Mapping
curl -X GET "http://localhost:9200/subjects/_mapping?pretty=true"
You should see
{
  "subjects" : { }
}
The pretty parameter above just formats the JSON result in a human readable format.

How to Install ElasticSearch on EC2

Search is not easy. There are a lot of things you need to consider.

In the software level,

Can a search query have spelling mistakes?
Should stop words (Ex. a, the) be filtered?
What about a phrase search given non-exact phrase?

In the operation level,

Should the search be decoupled from the app machines?
Should the search be distributed? If so, how many shards, replicas should be there?

Doing a quick search would tell you that Apache Lucene is the industry standard. There are two popular abstractions on top of Lucene: Solr and ElasticSearch (ES).

There are a lot of debates on which one should be used. I choose ES because
  • it's distributed by design
  • easier to integrate for AWS EC2

The following post will talk about how you can install ElasticSearch in your linux machine (I like to use the ubuntu 12.04 build from EC2).

Download elasticsearch from elasticsearch.org. Extract the files and put it into a folder of your choice (Ex. /opt/tools).
cd /opt/tools
wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-0.90.5.zip
unzip elasticsearch-0.90.5.zip
You can start elasticsearch by:
bin/elasticsearch -f
You may want to tweak the Xmx (max memory size the heap can reach for the JVM) and Xms (the inistal heap memory size for the JVM) values.
bin/elasticsearch -f -Xmx2g -Xms2g -Des.index.storage.type=memory -Des.max-open-files=true
You can also run it as a service using the script located in bin/service.

After you started your service, visit "http://localhost:9200" in the browser. You should see the following:

{
  "ok" : true,
  "status" : 200,
  "name" : "Solitaire",
  "version" : {
    "number" : "0.90.5",
    "build_hash" : "c8714e8e0620b62638f660f6144831792b9dedee",
    "build_timestamp" : "2013-09-17T12:50:20Z",
    "build_snapshot" : false,
    "lucene_version" : "4.4"
  },
  "tagline" : "You Know, for Search"
}

Thursday, September 26, 2013

Java reading and writing file line by line

We will be using BufferedReader to read a structured file line by line and then using BufferedWriter to write it out.

The example takes some structured data and creates MySQL insert statements for each dataset and then outputs it as a file.

Sunday, August 25, 2013

Uninstall NodeJS from MacOSX

Open your terminal.

Find where nodejs is installed by:
which node
In my case, it's in /usr/local/bin/node

Go to the folder that contains /bin/node
cd /usr/local
Remove all node related stuffs
sudo rm -rf bin/node bin/node-waf include/node lib/node lib/pkgconfig/nodejs.pc share/man/man1/node.1

Sunday, August 18, 2013

FireFox OS Tutorial - Creating a Percent Calculator App

In this post, I will demonstrate how to build a Firefox OS app. From beginning to finish, it took around half a day. But most of the time was spent on non-coding stuffs like taking screenshots of the app and making the icons.

For the purpose of this post, we will build something very simple - the Percent Calculator.

Here are the tools and frameworks I used:


Percent Calcuator

Here are some screenshots of the app:





Install the Firefox OS Simulator

Before we begin, be sure you have the latest version of the Firefox browser.

Download the Firefox OS Simulator as an add-on.

In the Firefox browser, click on the Firefox menu -> Web Developer -> Firefox OS Simulator.

This is your Firefox dashboard.


Toggle the Simulator button to "Running" as shown above.


Congratulations! You now have the simulator running. Play around with it to get a feel of how it works.

Creating the App Source Structure

Create the folder structure like the following:

root
->css
  ->app.css
->images
->js
  --app.js
index.html
manifest.webapp

Download the minified versions of jquery and jquery mobile and put them in the js folder above. You may also want to roll out your own jquery mobile theme.

After you are done, add the links to the head section of the index.html


    < link rel="stylesheet" href="css/app.css">
    < link rel="stylesheet" href="css/mytheme.min.css" />
    < link rel="stylesheet" href="css/jquery.mobile.structure-1.3.2.min.css" />
    < script src="js/jquery-1.9.1.min.js">< /script>
    < script src="js/jquery.mobile-1.3.2.min.js">< /script>
    < script src="js/app.js">< /script>

app.css will store all the styles while app.js will store all the logic. Note that it is very important to place all javascript codes in files outside of the index.html due to Content Security Policy (CSP).

Here's the code for index.html so far:



css/borderless.min.css is the css file I create using the jquery theme roller.

The Manifest File

manifest.webapp defines the app's metadata. You can specify version, description, icons, developer, permissions, language, etc.

Here's the sample manifest file:



You will want to bookmark the permission page.

Coding the App

If you know HTML, CSS and Javascript, you should have no problem with this part. If you do not know anything about it, click here.

There will be three files you will be constantly working with:

index.html - holds your page markup
css/app.css - all your stylings
js/app.js - all the app logic

Here are the source code for the app (All the stuffs should be self-explanatory.).

index.html


css/app.css



js/app.js



Creating the icons

Download the PSD file (Icon circle) at the bottom of this page. Open this in photoshop and create a logo. You will want sizes of 30x30 and 60x60.

Specify these in the manifest.

Publish to the Market

When you are ready, zip everything inside the root folder. Login to the Firefox Marketplace.

Test your zip file by uploading it to the app validator. Select App Type as Packaged.

You will probably see a bunch of CSP warnings. It is okay as long as the app is not a privileged or certified app.

When you are ready, submit it to the market. You will need to write a privacy policy as well.