亦余心之所善兮,虽九死其犹未悔。这篇文章主要讲述多字段特性及Mapping中配置自定义Analyzer相关的知识,希望能为你提供帮助。
目录
- 报错org.elasticsearch.bootstrap.StartupException: java.lang.IllegalStateException: failed to obtain node locks, tried
- 多字段特性
- Exact Values v.s Full Text
- Exact Values不需要分词
- 自定义分词
- Character Filter
- Tokenizer
- Token Filters
- 设置一个Custom Analyzer
- 提交请求,清除html标签
- 使用char filter进行替换减号
- char filter 替换表情符号
- 正则表达式
- 按目录切分
- whitespace与stop
- remove 加入lowercase后,The被当成stopword删除
- 听20章10分钟视频再记录
#原因:线程占用
#杀死进程,启动进程
kill -9 `ps -ef | grep [e]lasticsearch | grep [j]ava | awk '
{print $2}'
`
elasticsearch
多字段特性厂商名字实现精确匹配
增加一个keyword字段
使用不同的analyzer
不同语言
pinyin字段的搜索
还支持为搜索和索引指定不通的analyzer
Exact Values v.s Full TextExact Values:包括数字、日期、具体一个字符串(例如" Apple Store" )
Elasticsearch中的keyword
全文本,非结构化的文本数据
Elasticsearch中的text
Exact Values不需要分词Elasticsearch为每一个字段创建一个倒排索引
Exact Value在索引时,不需要做特殊的分词处理
自定义分词当Elasticsearch自带的分词器无法满足时,可以自定义分词器。通过自组合不同的组件实现
Character Filter
Tokenizer
Token Filter
Character Filter在Tokenizer之前对文本进行处理,例如增加删除及替换字符。可以配置多个Character Filter。会影响Tokenizer的position和offset信息
一些自带的Character Filters
html strip - 去除 html 标签
Mapping - 字符串替换
Pattern replace - 正则匹配替换
Tokenizer将原始的文本按照一定的规则,切分为词(term or token)
Elasticsearch内置的Tokenizers
whitespace / standard / uax_url_email / pattern / keyword / path hierarchy
可以用java开发插件,实现自己的Tokenizer
Token Filters将Tokenizer输出的单词(term),进行增加,修改,删除
自带的Token Filters
Lowercase / stop / synonym(添加近义词)
设置一个Custom Analyzer 提交请求,清除html标签
POST _analyze
{
"
tokenizer"
: "
keyword"
,
"
char_filter"
: ["
html_strip"
],
"
text"
:"
<
b>
hello world<
/b>
"
}
返回响应
{
"
tokens"
: [
{
"
token"
: "
hello world"
,
"
start_offset"
: 3,
"
end_offset"
: 18,
"
type"
: "
word"
,
"
position"
: 0
}
]
}
使用char filter进行替换减号
POST _analyze
{
"
tokenizer"
: "
standard"
,
"
char_filter"
: [
{
"
type"
:"
mapping"
,
"
mappings"
:["
- =>
_"
]
}
],
"
text"
: "
123-456, I-test! test-990 650-555-1234"
}
返回结果
{
"
tokens"
: [
{
"
token"
: "
123_456"
,
"
start_offset"
: 0,
"
end_offset"
: 7,
"
type"
: "
<
NUM>
"
,
"
position"
: 0
},
{
"
token"
: "
I_test"
,
"
start_offset"
: 9,
"
end_offset"
: 15,
"
type"
: "
<
ALPHANUM>
"
,
"
position"
: 1
},
{
"
token"
: "
test_990"
,
"
start_offset"
: 17,
"
end_offset"
: 25,
"
type"
: "
<
ALPHANUM>
"
,
"
position"
: 2
},
{
"
token"
: "
650_555_1234"
,
"
start_offset"
: 26,
"
end_offset"
: 38,
"
type"
: "
<
NUM>
"
,
"
position"
: 3
}
]
}
char filter 替换表情符号
POST _analyze
{
"
tokenizer"
: "
standard"
,
"
char_filter"
: [
{
"
type"
:"
mapping"
,
"
mappings"
:["
:) =>
happy"
,"
:( =>
sad"
]
}
],
"
text"
: ["
I am felling :)"
,"
Feeling :( today"
]
}
返回响应
{
"
tokens"
: [
{
"
token"
: "
I"
,
"
start_offset"
: 0,
"
end_offset"
: 1,
"
type"
: "
<
ALPHANUM>
"
,
"
position"
: 0
},
{
"
token"
: "
am"
,
"
start_offset"
: 2,
"
end_offset"
: 4,
"
type"
: "
<
ALPHANUM>
"
,
"
position"
: 1
},
{
"
token"
: "
felling"
,
"
start_offset"
: 5,
"
end_offset"
: 12,
"
type"
: "
<
ALPHANUM>
"
,
"
position"
: 2
},
{
"
token"
: "
happy"
,
"
start_offset"
: 13,
"
end_offset"
: 15,
"
type"
: "
<
ALPHANUM>
"
,
"
position"
: 3
},
{
"
token"
: "
Feeling"
,
"
start_offset"
: 16,
"
end_offset"
: 23,
"
type"
: "
<
ALPHANUM>
"
,
"
position"
: 104
},
{
"
token"
: "
sad"
,
"
start_offset"
: 24,
"
end_offset"
: 26,
"
type"
: "
<
ALPHANUM>
"
,
"
position"
: 105
},
{
"
token"
: "
today"
,
"
start_offset"
: 27,
"
end_offset"
: 32,
"
type"
: "
<
ALPHANUM>
"
,
"
position"
: 106
}
]
}
正则表达式
GET _analyze
{
"
tokenizer"
: "
standard"
,
"
char_filter"
: [
{
"
type"
:"
pattern_replace"
,
"
pattern"
:"
http://(.*)"
,
"
replacement"
:"
$1"
}
],
"
text"
: "
http://www.elastic.co"
}
返回值
"
tokens"
: [
{
"
token"
: "
www.elastic.co"
,
"
start_offset"
: 0,
"
end_offset"
: 21,
"
type"
: "
<
ALPHANUM>
"
,
"
position"
: 0
}
]
}
按目录切分
POST _analyze
{
"
tokenizer"
: "
path_hierarchy"
,
"
text"
: "
/usr/ymruan/a/b"
}
返回结果
{
"
tokens"
: [
{
"
token"
: "
/usr"
,
"
start_offset"
: 0,
"
end_offset"
: 4,
"
type"
: "
word"
,
"
position"
: 0
},
{
"
token"
: "
/usr/ymruan"
,
"
start_offset"
: 0,
"
end_offset"
: 11,
"
type"
: "
word"
,
"
position"
: 0
},
{
"
token"
: "
/usr/ymruan/a"
,
"
start_offset"
: 0,
"
end_offset"
: 13,
"
type"
: "
word"
,
"
position"
: 0
},
{
"
token"
: "
/usr/ymruan/a/b"
,
"
start_offset"
: 0,
"
end_offset"
: 15,
"
type"
: "
word"
,
"
position"
: 0
}
]
}
whitespace与stop
GET _analyze
{
"
tokenizer"
: "
whitespace"
,
"
filter"
: ["
stop"
],
"
text"
: ["
The rain in Spain falls mainly on the plain."
]
}
返回结果
{
"
tokens"
: [
{
"
token"
: "
The"
,
"
start_offset"
: 0,
"
end_offset"
: 3,
"
type"
: "
word"
,
"
position"
: 0
},
{
"
token"
: "
rain"
,
"
start_offset"
: 4,
"
end_offset"
: 8,
"
type"
: "
word"
,
"
position"
: 1
},
{
"
token"
: "
Spain"
,
"
start_offset"
: 12,
"
end_offset"
: 17,
"
type"
: "
word"
,
"
position"
: 3
},
{
"
token"
: "
falls"
,
"
start_offset"
: 18,
"
end_offset"
: 23,
"
type"
: "
word"
,
"
position"
: 4
},
{
"
token"
: "
mainly"
,
"
start_offset"
: 24,
"
end_offset"
: 30,
"
type"
: "
word"
,
"
position"
: 5
},
{
"
token"
: "
plain."
,
"
start_offset"
: 38,
"
end_offset"
: 44,
"
type"
: "
word"
,
"
position"
: 8
}
]
}
remove 加入lowercase后,The被当成stopword删除
GET _analyze
{
"
tokenizer"
: "
whitespace"
,
"
filter"
: ["
lowercase"
,"
stop"
],
"
text"
: ["
The rain in Spain falls mainly on the plain."
]
}
【多字段特性及Mapping中配置自定义Analyzer】返回结果
{
"
tokens"
: [
{
"
token"
: "
rain"
,
"
start_offset"
: 4,
"
end_offset"
: 8,
"
type"
: "
word"
,
"
position"
: 1
},
{
"
token"
: "
spain"
,
"
start_offset"
: 12,
"
end_offset"
: 17,
"
type"
: "
word"
,
"
position"
: 3
},
{
"
token"
: "
falls"
,
"
start_offset"
: 18,
"
end_offset"
: 23,
"
type"
: "
word"
,
"
position"
: 4
},
{
"
token"
: "
mainly"
,
"
start_offset"
: 24,
"
end_offset"
: 30,
"
type"
: "
word"
,
"
position"
: 5
},
{
"
token"
: "
plain."
,
"
start_offset"
: 38,
"
end_offset"
: 44,
"
type"
: "
word"
,
"
position"
: 8
}
]
}
听20章10分钟视频再记录
推荐阅读
- PAT甲级——A1155 HeapPaths30
- uniapp实现tab选项卡
- APP调用微信支付
- 六(Appium元素定位xpath定位方式)
- 如何利用主题分析获得更好的用户体验
- 线框的问题,直接达到高保真!
- 可穿戴技术(它如何运作以及为什么起作用)
- 停止设计垃圾(设计持久接口的指南)
- 家庭智能物联网家庭(物联网的驯化)