Consider a subject object with two properties like the following:
{Say we have a list of subjects like the above and we want to index and search subjects with the following requirement:
"name":"The Old & New British English",
"code":12345
}
- search by exact subject name
- search with stop words removed, accent characters conversion
- search with some spelling mistakes allowed
- search with some words skipped
- search by exact code
Before we define the ES schema, let's get familiar with the following terms.
A mapping defines how properties (Ex. "name" and "code" properties above) are indexed and searched through analyzers and tokenizers.
An analyzer is a group of filters executed in-order.
Reference: Analyzers
A filter is a function that transforms data (lowercase, stop-word removal, phonetics).
Reference: Token Filters
When we search/index for the phrase "The Old & New British English", an analyzer will break down the phrase into words through tokenizers. Each word/token is then passed through a bunch of token filters. For example, a lowercase token filter will normalize the incoming words to lowercased words.
For another explanation, refer to this post for a better understanding of analyzers.
The following defines a simple mapping with index=subjects, id=subject, and two properties (name, code).
curl -X PUT "http://localhost:9200/subjects" -d '
{
"mappings":{
"subject":{
"properties":{
"name":{
"type":"string"
},
"code":{
"type":"string"
}
}
}
}'
1.) Search by exact subject name
This is very easy. We will make the "name" field not indexed.
"subject":{
"properties":{
"name":{
"type":"string"
"index":"not_analyzed"
}
}
Let's popular the index.
curl -XPUT http://localhost:9200/subjects/subject/1 -d '
{
"name":"The Old & New British English",
"code":12345
}'
Try to do a search on the phrase "The Old & New British English"
curl -X GET "http://localhost:9200/subjects/_search?pretty=true" -d '{
"query" : {
"text" : { "name": "The Old & New British English" }
}
}'
Now try to search with "the Old & New British English" or "Old & New British English". This is not very helpful since most people won't search with case-sensitivity or exact phrases.
Let's delete this mapping.
curl -X DELETE "http://localhost:9200/subjects"
2) Search with stop words removed, accent characters conversion
Let's use a new custom analyzer called "full_name".
curl -X PUT "http://localhost:9200/subjects" -d '
{
"mappings":{
"subject":{
"properties":{
"name":{
"type":"string",
"analyzer":"full_name"
}
}
}
}
}
To customize the way searches would work, we need to tweak the analyzer settings. The general form of defining the settings is as follows:
"settings":{
"analysis":{
"filter":{
}
},
"analyzer":{
"full_name":{
"filter":[
],
"type":"custom",
"tokenizer":"standard"
}
}
}
We want "subject" to be searchable with stop words removed and normalized accent characters (so that the accent e can be searchable by by an 'e').
"settings":{
"analysis":{
"filter":{
}
},
"analyzer":{
"full_name":{
"filter":[
"standard",
"lowercase",
"asciifolding"
],
"type":"custom",
"tokenizer":"standard"
}
}
}
The lowercase filter normalizes token text to lower case. Since an analyzer is used both in the index time and search time, the lowercase filter will allow case-insensitivity searches.
Let's populate the schema to the ES cluster:
curl -X PUT "http://localhost:9200/subjects" -d '
{
"mappings":{
"subject":{
"properties":{
"name":{
"type":"string",
analyzer:"full_name"
}
}
}
},
"settings":{
"analysis":{
"analyzer":{
"full_name":{
"filter":[
"standard",
"lowercase",
"asciifolding"
],
"type":"custom",
"tokenizer":"standard"
}
}
}
}
}'
Populate ES with "The Old & New British English".
Search for the following:
- "The Old & New British English"
- "old & new british english"
- "british english"
- "british hello english"
- "engliah"
All of the above, expect the last one, should return the result.
3) Search with some spelling mistakes allowed
To make the search work for "engliah", we need to use the filter edgeNGram. edgeNGram takes in two parameters: "min_gram", "max_gram".
For the term "apple" with min_gram=3, max_gram=5, ES will index it with:
- app
- appl
- apple
curl -X PUT "http://localhost:9200/subjects" -d '
{
"mappings":{
"subject":{
"properties":{
"name":{
"type":"string",
"analyzer":"partial_name"
}
}
}
},
"settings":{
"analysis":{
"filter":{
"name_ngrams": {
"max_gram":10,
"min_gram":2,
"type": "edgeNGram"
}
},
"analyzer":{
"partial_name":{
"filter":[
"standard",
"lowercase",
"asciifolding",
"name_ngrams"
],
"type":"custom",
"tokenizer":"standard"
}
}
}
}
}'
Use _analyze to check how the phrase will be indexed.
curl -X GET "http://localhost:9200/subjects/_analyze?analyzer=partial_name&pretty=true" -d 'The Old & New British English'
Try to search for the term "engliah". You should see the result showing up.
4) Search with some words skipped
This is already working by 3) above.
5) Search by exact code
"subject":{
"properties":{
"code":{
"type":"string"
"index":"not_analyzed"
}
}
You can accomplish this with 1) or 2) above. For the purpose of accomplishing the exact search, if case-sensitivity is important for you, use 1), else use 2). I am opting 1) above.
Putting all these together
To accommodate for different search formats, we need to specify "subject" as a multi-field.
"subject":{
"properties":{
"name":{
"fields":{
"name":{
"type":"string",
"index":"not_analyzed"
},
"partial":{
"type":"string",
"search_analyzer":"full_name",
"index_analyzer":"partial_name"
}
},
"type":"multi_field"
}
}
"subject":{
"properties":{
"name":{
"fields":{
"name":{
"type":"string",
"index":"not_analyzed"
},
"partial":{
"type":"string",
"search_analyzer":"full_name",
"index_analyzer":"partial_name"
}
},
"type":"multi_field"
}
}
You can access "name" by "name.name", or just "name". This is the default field for "name" and it is defaulted to "full_name" - exact search.
You can access "partial" by "name.partial". This is the NGram search (spelling mistakes allowed). We are indexing the words with NGram variations, but using the exact term to search.
For example, consider a search for the term "app" within a data store with the following:
apples
appetizer
apes
If both search_analyzer and index_analyzer are using "partial_name", all three terms above will be returned.
If the search_analyzer is "full_name" and index_analyzer is "partial_name", then only "apples" and "appetizer" will be returned. This is the desired case.
Now putting the mapping all together:
curl -X PUT "http://localhost:9200/subjects" -d '
{
"mappings":{
"subject":{
"properties":{
"name":{
"fields":{
"name":{
"type":"string",
"analyzer":"full_name"
},
"partial":{
"type":"string",
"search_analyzer":"full_name",
"index_analyzer":"partial_name"
}
}
},
"code":{
"type":"string",
"analyzer":"full_name"
}
}
}
},
"settings":{
"analysis":{
"filter":{
"name_ngrams": {
"max_gram":10,
"min_gram":2,
"type": "edgeNGram"
}
},
"analyzer":{
"full_name":{
"filter":[
"standard",
"lowercase",
"asciifolding"
],
"type":"custom",
"tokenizer":"standard"
},
"partial_name":{
"filter":[
"standard",
"lowercase",
"asciifolding",
"name_ngrams"
],
"type":"custom",
"tokenizer":"standard"
}
}
}
}
}'
No comments:
Post a Comment