Elasticsearch高手进阶篇(23)
深度探秘搜索技术_实战match_phrase_prefix实现search-time
搜索推荐
搜索推荐
搜索推荐,search as you type,搜索提示,解释一下什么意思,简化来说就是我们的搜索还没有搜索完,对应的词条的推荐已经跃然纸上了
-
hello w –> 搜索
- hello world
- hello we
- hello win
- hello wind
- hello dog
- hello cat
-
hello w –>
- hello world
- hello we
- hello win
- hello wind
搜索推荐的功能
- 百度 –> elas –> elasticsearch –> elasticsearch权威指南
测试数据导入
PUT /waws_index/waws_type/1
{
"title":"hello world"
}
PUT /waws_index/waws_type/2
{
"title":"hello wind"
}
PUT /waws_index/waws_type/3
{
"title":"hello dark"
}
PUT /waws_index/waws_type/4
{
"title":"hello pig"
}
PUT /waws_index/waws_type/5
{
"title":"hello www.baidu.com"
}
复制代码
- 进行搜索推荐
GET /waws_index/waws_type/_search
{
"query": {
"match_phrase_prefix": {
"title": "hello w"
}
}
}
{
"took": 7,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0.7854939,
"hits": [
{
"_index": "waws_index",
"_type": "waws_type",
"_id": "2",
"_score": 0.7854939,
"_source": {
"title": "hello wind"
}
},
{
"_index": "waws_index",
"_type": "waws_type",
"_id": "5",
"_score": 0.51623213,
"_source": {
"title": "hello www.baidu.com"
}
},
{
"_index": "waws_index",
"_type": "waws_type",
"_id": "1",
"_score": 0.51623213,
"_source": {
"title": "hello world"
}
}
]
}
}
复制代码
原理跟match_phrase类似,唯一的区别,就是把最后一个term作为前缀去搜索(
重点
)
- hello就是去进行match,搜索对应的doc
- w,会作为前缀,去扫描整个倒排索引,找到所有w开头的doc
- 然后找到所有doc中,即包含hello,又包含w开头的字符的doc
- 根据你的slop去计算,看在slop范围内,能不能让hello w,正好跟doc中的hello和w开头的单词的position相匹配
- 也可以指定
slop
,但是只有最后一个term会作为前缀
-
max_expansions:指定prefix最多匹配多少个term,超过这个数量就不继续匹配了,限定性能
- 默认情况下,前缀要扫描所有的倒排索引中的term,去查找w打头的单词,但是这样性能太差。可以用max_expansions限定,w前缀最多匹配多少个term,就不再继续搜索倒排索引了。
尽量不要用,因为,最后一个前缀始终要去扫描大量的索引,性能可能会很差
Elasticsearch高手进阶篇(24)
深度探秘搜索技术_实战通过ngram分词机制实现index-time搜索推荐
ngram和index-time搜索推荐原理
- ngram
quick,5种长度下的ngram
- ngram length=1,q u i c k
- ngram length=2,qu ui ic ck
- ngram length=3,qui uic ick
- ngram length=4,quic uick
- ngram length=5,quick
- edge ngram
quick,anchor首字母后进行ngram
- q
- qu
- qui
- quic
- quick
- 使用edge ngram将每个单词都进行进一步的分词切分,用切分后的ngram来实现前缀搜索推荐功能
hello world hello we
- h
- he
- hel
- hell
- hello doc1, doc2
- w doc1, doc2
- wo
- wor
- worl
- world
- e doc2
- helloworld
min ngram = 1 max ngram = 3
- h
- he
- hel
- hello w
- hello –> hello,doc1
- w –> w,doc1
- doc1,hello和w,而且position也匹配,所以,ok,doc1返回,hello world
搜索的时候,不用再根据一个前缀,然后扫描整个倒排索引了; 简单的拿前缀去倒排索引中匹配即可,如果匹配上了,那么就好了; match,全文检索
实验一下ngram
PUT /waws_index
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type":"edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": {
"type":"custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
}
}
复制代码
- 添加数据
PUT /waws_index/waws_type/1
{
"title":"hello world"
}
PUT /waws_index/waws_type/2
{
"title":"hello win98"
}
PUT /waws_index/waws_type/3
{
"title":"hello pig"
}
PUT /waws_index/waws_type/4
{
"title":"hello dog"
}
PUT /waws_index/waws_type/5
{
"title":"hello wink"
}
PUT /waws_index/waws_type/6
{
"title":"hello www.baidu.com"
}
复制代码
- 测试分词效果
GET /waws_index/_analyze
{
"analyzer": "autocomplete",
"text": "hello world"
}
{
"tokens": [
{
"token": "h",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "he",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "hel",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "hell",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "hello",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "w",
"start_offset": 6,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "wo",
"start_offset": 6,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "wor",
"start_offset": 6,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "worl",
"start_offset": 6,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "world",
"start_offset": 6,
"end_offset": 11,
"type": "word",
"position": 1
}
]
}
复制代码
- 展示mapping
# 我们的搜索的时候的分词器是standard分词器,在构建index的时候使用n-gram的方式构建(需要在没有数据的时候,进行mapping指定)
PUT /waws_index/_mapping/waws_type
{
"properties": {
"title": {
"type":"string",
"analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
复制代码
- 搜索数据
GET /waws_index/waws_type/_search
{
"query": {
"match_phrase": {
"title": "hello w"
}
}
}
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 0.8899311,
"hits": [
{
"_index": "waws_index",
"_type": "waws_type",
"_id": "2",
"_score": 0.8899311,
"_source": {
"title": "hello win98"
}
},
{
"_index": "waws_index",
"_type": "waws_type",
"_id": "6",
"_score": 0.8899311,
"_source": {
"title": "hello www.baidu.com"
}
},
{
"_index": "waws_index",
"_type": "waws_type",
"_id": "1",
"_score": 0.8271048,
"_source": {
"title": "hello world"
}
},
{
"_index": "waws_index",
"_type": "waws_type",
"_id": "5",
"_score": 0.8134969,
"_source": {
"title": "hello wink"
}
}
]
}
}
复制代码
- 如果用match,只有hello的也会出来,全文检索,只是分数比较低
- 推荐使用match_phrase,要求每个term都有,而且position刚好靠着1位,符合我们的期望的
© 版权声明
文章版权归作者所有,未经允许请勿转载。
THE END