Disclaimer: I'm an ElasticSearch noob. Go easy on me
I have an application that uses ElasticSearch's more_like_this query to find related content. It basically works like this:
>>> index(index, doc_type, {'id': 1, 'title': 'Your cool title is here'}) >>> index(index, doc_type, {'id': 2, 'title': 'About is a cool headline'}) >>> index(index, doc_type, {'id': 3, 'title': 'Titles are your big thing'})
Then you can pick one ID (1, 2 or 3) and find related ones.
We can tell by looking at these three silly examples, the 1 and 2 have the words "is" and "cool" in common. 1 and 3 have "title" (stemming taken into account) and "your" in common. However, is there much value in connected these documents on the words "is" and "your"? I think not. Those are stop words. E.g. words like "the", "this", "from", "she" etc. Basically words that are commonly used as "glue" between more unique and specific words.
Anyway, if you index something in ElasticSearch as a text field you get, by default, the "standard" analyzer to analyze the incoming stuff to be indexed. The standard analyzer just splits the words on whitespace. A more compelling analyzer is the Snowball analyzer (original here) which supports intelligent stemming (turning "wife" ~= "wives") and stop words.
The problem is that the snowball analyzer has a very different set of stop words. We did some digging and thought this was the list it bases its English stop words on. But this was wrong. Note that that list has words like "your" and "about" listed there.
The way to find out how your analyzer treats a string and turns it into token is to the the _analyze
tool. For example:
curl -XGET 'localhost:9200/{myindexname}/_analyze?analyzer=snowball' -d 'about your special is a the word' | json_print { "tokens": [ { "end_offset": 5, "token": "about", "type": "<ALPHANUM>", "start_offset": 0, "position": 1 }, { "end_offset": 10, "token": "your", "type": "<ALPHANUM>", "start_offset": 6, "position": 2 }, { "end_offset": 18, "token": "special", "type": "<ALPHANUM>", "start_offset": 11, "position": 3 }, { "end_offset": 32, "token": "word", "type": "<ALPHANUM>", "start_offset": 28, "position": 7 } ] }
So what you can see is that it finds the tokens "about", "your", "special" and "word". But it stop word ignored "is", "a" and "the". Hmm... I'm not happy with that. I don't think "about" and "your" are particularly helpful words.
So, how do you define your own stop words and override the one in the Snowball analyzer? Well, let me show you.
In code, I use pyelasticsearch so the index creation is done in Python.
STOPWORDS = (
"a able about across after all almost also am among an and "
"any are as at be because been but by can cannot could dear "
"did do does either else ever every for from get got had has "
"have he her hers him his how however i if in into is it its "
"just least let like likely may me might most must my "
"neither no nor not of off often on only or other our own "
"rather said say says she should since so some than that the "
"their them then there these they this tis to too twas us "
"wants was we were what when where which while who whom why "
"will with would yet you your".split()
)
def create():
es = get_connection()
index = get_index()
es.create_index(index, settings={
'settings': {
'analysis': {
'analyzer': {
'extended_snowball_analyzer': {
'type': 'snowball',
'stopwords': STOPWORDS,
},
},
},
},
'mappings': {
doc_type: {
'properties': {
'title': {
'type': 'string',
'analyzer': 'extended_snowball_analyzer',
},
}
}
}
})
With that in place, now delete your index and re-create it. Now you can use the _analyze
tool again to see how it analyzes text on this particular field. But note, to do this we need to know the name of the index we used. (so replace {myindexname}
in the URL):
$ curl -XGET 'localhost:9200/{myindexname}/_analyze?field=title' -d 'about your special is a the word' | json_print { "tokens": [ { "end_offset": 18, "token": "special", "type": "<ALPHANUM>", "start_offset": 11, "position": 3 }, { "end_offset": 32, "token": "word", "type": "<ALPHANUM>", "start_offset": 28, "position": 7 } ] }
Cool! Now we see that it considers "about" and "your" as stop words. Much better. This is handy too because you might have certain words that are globally not very common but within your application it's very repeated and not very useful.
Thank you willkg and Erik Rose for your support in tracking this down!