多字段特性及Mapping中配置自定义Analyzer

亦余心之所善兮,虽九死其犹未悔。这篇文章主要讲述多字段特性及Mapping中配置自定义Analyzer相关的知识,希望能为你提供帮助。
目录

  • 报错org.elasticsearch.bootstrap.StartupException: java.lang.IllegalStateException: failed to obtain node locks, tried
  • 多字段特性
  • Exact Values v.s Full Text
  • Exact Values不需要分词
  • 自定义分词
  • Character Filter
  • Tokenizer
  • Token Filters
  • 设置一个Custom Analyzer
    • 提交请求,清除html标签
    • 使用char filter进行替换减号
    • char filter 替换表情符号
    • 正则表达式
    • 按目录切分
    • whitespace与stop
    • remove 加入lowercase后,The被当成stopword删除
    • 听20章10分钟视频再记录
报错org.elasticsearch.bootstrap.StartupException: java.lang.IllegalStateException: failed to obtain node locks, tried
#原因:线程占用 #杀死进程,启动进程 kill -9 `ps -ef | grep [e]lasticsearch | grep [j]ava | awk ' {print $2}' ` elasticsearch

多字段特性厂商名字实现精确匹配
增加一个keyword字段
使用不同的analyzer
不同语言
pinyin字段的搜索
还支持为搜索和索引指定不通的analyzer
Exact Values v.s Full TextExact Values:包括数字、日期、具体一个字符串(例如" Apple Store" )
Elasticsearch中的keyword
全文本,非结构化的文本数据
Elasticsearch中的text
Exact Values不需要分词Elasticsearch为每一个字段创建一个倒排索引
Exact Value在索引时,不需要做特殊的分词处理
自定义分词当Elasticsearch自带的分词器无法满足时,可以自定义分词器。通过自组合不同的组件实现
Character Filter
Tokenizer
Token Filter
Character Filter在Tokenizer之前对文本进行处理,例如增加删除及替换字符。可以配置多个Character Filter。会影响Tokenizer的position和offset信息
一些自带的Character Filters
html strip - 去除 html 标签
Mapping - 字符串替换
Pattern replace - 正则匹配替换
Tokenizer将原始的文本按照一定的规则,切分为词(term or token)
Elasticsearch内置的Tokenizers
whitespace / standard / uax_url_email / pattern / keyword / path hierarchy
可以用java开发插件,实现自己的Tokenizer
Token Filters将Tokenizer输出的单词(term),进行增加,修改,删除
自带的Token Filters
Lowercase / stop / synonym(添加近义词)
设置一个Custom Analyzer 提交请求,清除html标签
POST _analyze { " tokenizer" : " keyword" , " char_filter" : [" html_strip" ], " text" :" < b> hello world< /b> " }

返回响应
{ " tokens" : [ { " token" : " hello world" , " start_offset" : 3, " end_offset" : 18, " type" : " word" , " position" : 0 } ] }

使用char filter进行替换减号
POST _analyze { " tokenizer" : " standard" , " char_filter" : [ { " type" :" mapping" , " mappings" :[" - => _" ] } ], " text" : " 123-456, I-test! test-990 650-555-1234" }

返回结果
{ " tokens" : [ { " token" : " 123_456" , " start_offset" : 0, " end_offset" : 7, " type" : " < NUM> " , " position" : 0 }, { " token" : " I_test" , " start_offset" : 9, " end_offset" : 15, " type" : " < ALPHANUM> " , " position" : 1 }, { " token" : " test_990" , " start_offset" : 17, " end_offset" : 25, " type" : " < ALPHANUM> " , " position" : 2 }, { " token" : " 650_555_1234" , " start_offset" : 26, " end_offset" : 38, " type" : " < NUM> " , " position" : 3 } ] }

char filter 替换表情符号
POST _analyze { " tokenizer" : " standard" , " char_filter" : [ { " type" :" mapping" , " mappings" :[" :) => happy" ," :( => sad" ] } ], " text" : [" I am felling :)" ," Feeling :( today" ] }

返回响应
{ " tokens" : [ { " token" : " I" , " start_offset" : 0, " end_offset" : 1, " type" : " < ALPHANUM> " , " position" : 0 }, { " token" : " am" , " start_offset" : 2, " end_offset" : 4, " type" : " < ALPHANUM> " , " position" : 1 }, { " token" : " felling" , " start_offset" : 5, " end_offset" : 12, " type" : " < ALPHANUM> " , " position" : 2 }, { " token" : " happy" , " start_offset" : 13, " end_offset" : 15, " type" : " < ALPHANUM> " , " position" : 3 }, { " token" : " Feeling" , " start_offset" : 16, " end_offset" : 23, " type" : " < ALPHANUM> " , " position" : 104 }, { " token" : " sad" , " start_offset" : 24, " end_offset" : 26, " type" : " < ALPHANUM> " , " position" : 105 }, { " token" : " today" , " start_offset" : 27, " end_offset" : 32, " type" : " < ALPHANUM> " , " position" : 106 } ] }

正则表达式
GET _analyze { " tokenizer" : " standard" , " char_filter" : [ { " type" :" pattern_replace" , " pattern" :" http://(.*)" , " replacement" :" $1" } ], " text" : " http://www.elastic.co" }

返回值
" tokens" : [ { " token" : " www.elastic.co" , " start_offset" : 0, " end_offset" : 21, " type" : " < ALPHANUM> " , " position" : 0 } ] }

按目录切分
POST _analyze { " tokenizer" : " path_hierarchy" , " text" : " /usr/ymruan/a/b" }

返回结果
{ " tokens" : [ { " token" : " /usr" , " start_offset" : 0, " end_offset" : 4, " type" : " word" , " position" : 0 }, { " token" : " /usr/ymruan" , " start_offset" : 0, " end_offset" : 11, " type" : " word" , " position" : 0 }, { " token" : " /usr/ymruan/a" , " start_offset" : 0, " end_offset" : 13, " type" : " word" , " position" : 0 }, { " token" : " /usr/ymruan/a/b" , " start_offset" : 0, " end_offset" : 15, " type" : " word" , " position" : 0 } ] }

whitespace与stop
GET _analyze { " tokenizer" : " whitespace" , " filter" : [" stop" ], " text" : [" The rain in Spain falls mainly on the plain." ] }

返回结果
{ " tokens" : [ { " token" : " The" , " start_offset" : 0, " end_offset" : 3, " type" : " word" , " position" : 0 }, { " token" : " rain" , " start_offset" : 4, " end_offset" : 8, " type" : " word" , " position" : 1 }, { " token" : " Spain" , " start_offset" : 12, " end_offset" : 17, " type" : " word" , " position" : 3 }, { " token" : " falls" , " start_offset" : 18, " end_offset" : 23, " type" : " word" , " position" : 4 }, { " token" : " mainly" , " start_offset" : 24, " end_offset" : 30, " type" : " word" , " position" : 5 }, { " token" : " plain." , " start_offset" : 38, " end_offset" : 44, " type" : " word" , " position" : 8 } ] }

remove 加入lowercase后,The被当成stopword删除
GET _analyze { " tokenizer" : " whitespace" , " filter" : [" lowercase" ," stop" ], " text" : [" The rain in Spain falls mainly on the plain." ] }

【多字段特性及Mapping中配置自定义Analyzer】返回结果
{ " tokens" : [ { " token" : " rain" , " start_offset" : 4, " end_offset" : 8, " type" : " word" , " position" : 1 }, { " token" : " spain" , " start_offset" : 12, " end_offset" : 17, " type" : " word" , " position" : 3 }, { " token" : " falls" , " start_offset" : 18, " end_offset" : 23, " type" : " word" , " position" : 4 }, { " token" : " mainly" , " start_offset" : 24, " end_offset" : 30, " type" : " word" , " position" : 5 }, { " token" : " plain." , " start_offset" : 38, " end_offset" : 44, " type" : " word" , " position" : 8 } ] }

听20章10分钟视频再记录

    推荐阅读