Thursday, October 3, 2013

ElasticSearch - Defining the Mapping Schema

The previous posts demonstrate how easy it is to index some words and retrieve them via the REST or Java API.  However, we never really talk about how to tweak the searches to fit our needs.

Consider a subject object with two properties like the following:
{
  "name":"The Old & New British English",
  "code":12345
}
Say we have a list of subjects like the above and we want to index and search subjects with the following requirement:
  1. search by exact subject name
  2. search with stop words removed, accent characters conversion
  3. search with some spelling mistakes allowed
  4. search with some words skipped
  5. search by exact code
Without specifying the mapping, ElasticSearch (ES) will use the standard analyzer.

Before we define the ES schema, let's get familiar with the following terms.

A mapping defines how properties (Ex. "name" and "code" properties above) are indexed and searched through analyzers and tokenizers.

An analyzer is a group of filters executed in-order.
Reference: Analyzers

A filter is a function that transforms data (lowercase, stop-word removal, phonetics).
Reference: Token Filters

When we search/index for the phrase "The Old & New British English", an analyzer will break down the phrase into words through tokenizers. Each word/token is then passed through a bunch of token filters.  For example, a lowercase token filter will normalize the incoming words to lowercased words.

For another explanation, refer to this post for a better understanding of analyzers.

The following defines a simple mapping with index=subjects, id=subject, and two properties (name, code).

curl -X PUT "http://localhost:9200/subjects" -d '
{
  "mappings":{
     "subject":{
          "properties":{
            "name":{
              "type":"string"
            },
          "code":{
              "type":"string"
          }
    }
  }
}'


1.) Search by exact subject name

This is very easy. We will make the "name" field not indexed.

"subject":{
          "properties":{
            "name":{
              "type":"string"
              "index":"not_analyzed"
            }
         }

Let's popular the index.

curl -XPUT http://localhost:9200/subjects/subject/1 -d '
{
  "name":"The Old & New British English",
  "code":12345
}'

Try to do a search on the phrase "The Old & New British English"

curl -X GET "http://localhost:9200/subjects/_search?pretty=true" -d '{
    "query" : {
        "text" : { "name": "The Old & New British English" }
    }
}'

Now try to search with "the Old & New British English" or "Old & New British English". This is not very helpful since most people won't search with case-sensitivity or exact phrases.

Let's delete this mapping.

curl -X DELETE "http://localhost:9200/subjects"


2) Search with stop words removed, accent characters conversion

Let's use a new custom analyzer called "full_name".

curl -X PUT "http://localhost:9200/subjects" -d '
{
  "mappings":{
      "subject":{
          "properties":{
            "name":{
              "type":"string",
              "analyzer":"full_name"
            }
          }
      }
  }
}

To customize the way searches would work, we need to tweak the analyzer settings.  The general form of defining the settings is as follows:

"settings":{
    "analysis":{
        "filter":{
        }
    },
    "analyzer":{
        "full_name":{
            "filter":[
            ],
            "type":"custom",
            "tokenizer":"standard"
        }
    }
}

We want "subject" to be searchable with stop words removed and normalized accent characters (so that the accent e can be searchable by by an 'e').

"settings":{
    "analysis":{
        "filter":{
        }
    },
    "analyzer":{
        "full_name":{
            "filter":[
                "standard",
                "lowercase",
                "asciifolding"
            ],
            "type":"custom",
            "tokenizer":"standard"
        }
    }
}

The lowercase filter normalizes token text to lower case. Since an analyzer is used both in the index time and search time, the lowercase filter will allow case-insensitivity searches.

Let's populate the schema to the ES cluster:

curl -X PUT "http://localhost:9200/subjects" -d '
{
  "mappings":{
      "subject":{
          "properties":{
            "name":{
              "type":"string",
              analyzer:"full_name"
            }
          }
      }
  },
  "settings":{
    "analysis":{
      "analyzer":{
        "full_name":{
          "filter":[
            "standard",
            "lowercase",
            "asciifolding"
          ],
          "type":"custom",
          "tokenizer":"standard"
        }
      }
    }
  }
}'

Populate ES with "The Old & New British English".

Search for the following:
  • "The Old & New British English"
  • "old & new british english"
  • "british english"
  • "british hello english"
  • "engliah"

All of the above, expect the last one, should return the result.


3) Search with some spelling mistakes allowed

To make the search work for "engliah", we need to use the filter edgeNGram.  edgeNGram takes in two parameters: "min_gram", "max_gram".

For the term "apple" with min_gram=3, max_gram=5, ES will index it with:
  • app
  • appl
  • apple
Let's try this.

curl -X PUT "http://localhost:9200/subjects" -d '
{
  "mappings":{
      "subject":{
          "properties":{
            "name":{
              "type":"string",
              "analyzer":"partial_name"
            }
          }
      }
  },
  "settings":{
    "analysis":{
      "filter":{
        "name_ngrams": {
          "max_gram":10,
          "min_gram":2,
          "type": "edgeNGram"
        }
      },
      "analyzer":{
        "partial_name":{
          "filter":[
            "standard",
            "lowercase",
            "asciifolding",
            "name_ngrams"
          ],
          "type":"custom",
          "tokenizer":"standard"
        }
      }
    }
  }
}'

Use _analyze to check how the phrase will be indexed.

curl -X GET "http://localhost:9200/subjects/_analyze?analyzer=partial_name&pretty=true" -d 'The Old & New British English'

Try to search for the term "engliah".  You should see the result showing up.


4) Search with some words skipped

This is already working by 3) above.


5) Search by exact code

"subject":{
          "properties":{
            "code":{
              "type":"string"
              "index":"not_analyzed"
            }
         }

You can accomplish this with 1) or 2) above.  For the purpose of accomplishing the exact search, if case-sensitivity is important for you, use 1), else use 2).  I am opting 1) above.


Putting all these together

To accommodate for different search formats, we need to specify "subject" as a multi-field.

"subject":{
      "properties":{
        "name":{
          "fields":{
            "name":{
              "type":"string",
              "index":"not_analyzed"
            },
            "partial":{
                "type":"string",
                "search_analyzer":"full_name",
                "index_analyzer":"partial_name"
             }
          },
          "type":"multi_field"
        }
      }

You can access "name" by "name.name", or just "name".  This is the default field for "name" and it is defaulted to "full_name" - exact search.

You can access "partial" by "name.partial".  This is the NGram search (spelling mistakes allowed).  We are indexing the words with NGram variations, but using the exact term to search.

For example, consider a search for the term "app" within a data store with the following:
apples
appetizer
apes

If both search_analyzer and index_analyzer are using "partial_name", all three terms above will be returned.

If the search_analyzer is "full_name" and index_analyzer is "partial_name", then only "apples" and "appetizer" will be returned.  This is the desired case.

Now putting the mapping all together:

curl -X PUT "http://localhost:9200/subjects" -d '
{
  "mappings":{
      "subject":{
          "properties":{
            "name":{
              "fields":{
                  "name":{
                      "type":"string",
                      "analyzer":"full_name"
                  },
                  "partial":{
                      "type":"string",
                      "search_analyzer":"full_name",
                      "index_analyzer":"partial_name"
                  }
              }
            },
            "code":{
                "type":"string",
                "analyzer":"full_name"
            }
          }
      }
  },
  "settings":{
    "analysis":{
      "filter":{
        "name_ngrams": {
          "max_gram":10,
          "min_gram":2,
          "type": "edgeNGram"
        }
      },
      "analyzer":{
        "full_name":{
          "filter":[
            "standard",
            "lowercase",
            "asciifolding"
          ],
          "type":"custom",
          "tokenizer":"standard"
        },
        "partial_name":{
          "filter":[
            "standard",
            "lowercase",
            "asciifolding",
            "name_ngrams"
          ],
          "type":"custom",
          "tokenizer":"standard"
        }
      }
    }
  }
}'

No comments:

Post a Comment