0%

人名搜索 - 如何改进结果

基于Elasticsearch如何实现一个好用的用户名检索?

先说一下背景,需要提供一个检索接口,根据用户输入的值去检索:姓名、姓名拼音、工号、昵称等。

迭代过的版本:

  • 标准分词
    • 中文支持不好
  • wildcard type
    • 因为有安全考虑,所以限制了总的返回条数为10,导致完整匹配的不是在前10条
    • 这里还遇到了一个坑,华为云elasticsearch是基于OpenSearch的,不支持wildcard这种类型

关于姓名这个,由于大多数用户的姓名是汉字,所以第一个想到的是ik

IK方案

构建Dockerfile

1
2
3
4
5
6
7
8
9
10
11
FROM registry.jiankunking.com/library/elasticsearch:7.13.4

# RUN ./bin/elasticsearch-plugin install --batch https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.13.4/elasticsearch-analysis-ik-7.13.4.zip
ADD ik/elasticsearch-analysis-ik-7.13.4.zip /usr/share/elasticsearch
RUN ./bin/elasticsearch-plugin install --batch file:///usr/share/elasticsearch/elasticsearch-analysis-ik-7.13.4.zip

ADD pinyin/elasticsearch-analysis-pinyin-7.13.4.zip /usr/share/elasticsearch
RUN ./bin/elasticsearch-plugin install --batch file:///usr/share/elasticsearch/elasticsearch-analysis-pinyin-7.13.4.zip

# 移除插件安装包
RUN rm -f /usr/share/elasticsearch/elasticsearch-analysis-ik-7.13.4.zip

IK插件已经上传到代码中,所有没有采用在线安装的方式
已将姓名字典放置到 elasticsearch-analysis-ik-7.13.4/config/custom_user_name.dic

IKAnalyzer.cfg.xml也已修改

1
2
3
4
5
6
7
8
9
10
11
12
13
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
<comment>IK Analyzer 扩展配置</comment>
<!--用户可以在这里配置自己的扩展字典 -->
<entry key="ext_dict">custom_user_name.dic</entry>
<!--用户可以在这里配置自己的扩展停止词字典-->
<!-- <entry key="ext_stopwords"></entry> -->
<!--用户可以在这里配置远程扩展字典 -->
<!-- <entry key="remote_ext_dict">words_location</entry> -->
<!--用户可以在这里配置远程扩展停止词字典-->
<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

验证

模版

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
POST /_template/jiankunking-attr
{
"order": 0,
"index_patterns": [
"jiankunking-attrs",
"jiankunking-attrs-dev"
],
"settings": {
"index": {
"number_of_shards": "6",
"number_of_replicas": "1",
"refresh_interval": "200ms"
}
},
"mappings": {
"dynamic_templates": [
{
"strings": {
"mapping": {
"type": "keyword"
},
"match_mapping_type": "string"
}
}
],
"properties": {
"id": {
"type": "keyword"
},
"attrs.user_name_ik.attrValue": {
"type": "text",
"analyzer": "ik_max_word",
"search_analyzer": "ik_smart"
},
"creator": {
"type": "keyword"
},
"updater": {
"type": "keyword"
},
"createdAt": {
"format": "epoch_second",
"type": "date"
},
"updatedAt": {
"format": "epoch_second",
"type": "date"
}
}
}
}

查询验证过程略

结论

1、需要自定义词库
使用IK插件但不自定义词库的话,无法正确的分词姓名;
使用词库的话,需要穷举可能得搜索项,
假如,词典中只加载姓名,那么对于搜索后半段,比如:孙新伟,搜索:新伟,会搜索不到,这时候搜索结果就会不符合预期。
2、对于英文姓名分词有一定限制
比如搜索:dam,Adam Dean不会被检索到
3、需要安装IK插件,需要重启集群

Ngram方案

验证

模版

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
POST /_template/jiankunking-attr-ngram
{
"order": 0,
"index_patterns": [
"jiankunking-attrs-v3-*"
],
"settings": {
"index": {
"max_ngram_diff": "9",
"refresh_interval": "200ms",
"analysis": {
"analyzer": {
"ngram_analyzer": {
"tokenizer": "ngram"
}
},
"tokenizer": {
"ngram": {
"token_chars": [
"letter",
"digit"
],
"min_gram": "1",
"type": "ngram",
"max_gram": "10"
}
}
},
"number_of_shards": "10",
"number_of_replicas": "1"
}
},
"mappings": {
"dynamic_templates": [
{
"strings": {
"mapping": {
"type": "keyword"
},
"match_mapping_type": "string"
}
}
],
"properties": {
"attrs.pinyin.attrValue": {
"search_analyzer": "ngram_analyzer",
"analyzer": "ngram_analyzer",
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"createdAt": {
"format": "epoch_second",
"type": "date"
},
"creator": {
"type": "keyword"
},
"attrs.nickname.attrValue": {
"search_analyzer": "ngram_analyzer",
"analyzer": "ngram_analyzer",
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"attrs.username.attrValue": {
"search_analyzer": "ngram_analyzer",
"analyzer": "ngram_analyzer",
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"attrs.user_id.attrValue": {
"type": "keyword",
"fields": {
"text": {
"search_analyzer": "ngram_analyzer",
"analyzer": "ngram_analyzer",
"type": "text"
}
}
},
"id": {
"type": "keyword"
},
"attrs": {
"type": "object"
},
"updater": {
"type": "keyword"
},
"updatedAt": {
"format": "epoch_second",
"type": "date"
}
}
},
"aliases": {}
}

查询验证过程略

结论

搜索返回的数据中,会有相似数据
比如搜索:”土豆儿”,会匹配到:”王豆豆”、”田豆豆”等
Ngram分词会消耗大量资源(尤其是磁盘),reindex有可能会超时

结论

由于用户数据量不到100万,所以即使资源效果的多一些,也在可以接受的范围内,所以最终采取了Ngram方案

最优的查询语句如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
{
"size": 10,
"_source": [
"attrs.username.attrValue",
"attrs.pinyin.attrValue",
"attrs.user_id.attrValue",
"attrs.nickname.attrValue"
],
"query": {
"multi_match": {
"query": "jiankunking",
"type": "best_fields",
"fields": [
"attrs.user_id.attrValue.text",
"attrs.username.attrValue",
"attrs.nickname.attrValue",
"attrs.pinyin.attrValue"
]
}
}
}

欢迎关注我的其它发布渠道