Elasticsearch-分析器

1.注册分析器 nalyzertokenizerfilter可以在elasticsearch.yml 配置

index : analysis : analyzer : standard : type: standard stopwords : [stop1, stop2] myAnalyzer1 : type: standard stopwords : [stop1, stop2, stop3] max_token_length : 500 myAnalyzer2 : tokenizer : standard filter : [standard, lowercase, stop] tokenizer : myTokenizer1 : type: standard max_token_length : 900 myTokenizer2 : type: keyword buffer_size : 512 filter : myTokenFilter1 : type: stop stopwords : [stop1, stop2, stop3, stop4] myTokenFilter2 : type: length min : 0 max : 2000

analyzer:ES内置若干analyzer, 另外还可以用内置的character filter, tokenizer, token filter组装一个analyzer(custom analyzer)
index : analysis : analyzer : myAnalyzer : tokenizer : standard filter : [standard, lowercase, stop]

如果你要使用第三方的analyzer插件,需要先在配置文件elasticsearch.yml中注册, 下面是配置IkAnalyzer的例子
index: analysis: analyzer: ik: alias: [ik_analyzer] type: org.elasticsearch.index.analysis.IkAnalyzerProvider

【Elasticsearch-分析器】当一个analyzer在配置文件中被注册到一个名字(logical name)下后,在mapping定义或者一些API里就可以用这个名字来引用该analyzer了。
二.ES中内置的analyzer,tokenizer,filter ES内置的一些analyzer
analyzer logical name description
standard analyzer standard standard tokenizer, standard filter, lower case filter, stop filter
simple analyzer simple lower case tokenizer
stop analyzer stop lower case tokenizer, stop filter
keyword analyzer keyword 不分词,内容整体作为一个token(not_analyzed)
pattern analyzer whitespace 正则表达式分词,默认匹配
language analyzers lang 各种语言
snowball analyzer snowball standard tokenizer, standard filter, lower case filter, stop filter, snowball filter
custom analyzer custom 一个Tokenizer, 零个或多个Token Filter, 零个或多个Char Filter
tokenizer:ES内置的tokenizer列表
tokenizer logical name description
standard tokenizer standard
edge ngram tokenizer edgeNGram
keyword tokenizer keyword 不分词
letter analyzer letter 按单词分
lowercase analyzer lowercase letter tokenizer, lower case filter
ngram analyzers nGram
whitespace analyzer whitespace 以空格为分隔符拆分
pattern analyzer pattern 定义分隔符的正则表达式
uax email url analyzer uax_url_email 不拆分url和email
path hierarchy analyzer path_hierarchy 处理类似/path/to/somthing样式的字符串
token filter:ES内置的token filter列表。
token filter logical name description
standard filter standar
dascii folding filter ascii folding
lengthfilter length 去掉太长或者太短的
lowercase filter lowercase 转成小写
ngram filter nGram
edge ngram filter edgeNGram
porter stem filter porterStem 波特词干算法
shingle filter shingle 定义分隔符的正则表达式
stop filter stop 移除 stop wordsword
delimiter filter word_delimiter 将一个单词再拆成子分词
stemmer token filter stemmer
stemmer override filter stemmer_override
keyword marker filter keyword_marker
keyword repeat filter keyword_repeat
kstem filter kstem
snowball filter snowball
phonetic filte rphonetic 插件
synonym filter synonyms 处理同义词
compound word filter dictionary_decompounder, hyphenation_decompounder 分解复合词
reverse filter reverse 反转字符串
elision filter elision 去掉缩略语
truncate filter truncate 截断字符串
unique filter unique
pattern capture filter pattern_capture
pattern replace filter pattern_replace 用正则表达式替换
trim filter trim 去掉空格
limit token count filter limit 限制token数量
hunspell filter hunspell 拼写检查
common grams filter common_grams
normalization filter arabic_normalization, persian_normalization
character filter:ES内置的character filter列表
character filter logical name description
mapping char filter mapping 根据配置的映射关系替换字符
html strip char filter html_strip 去掉HTML元素
pattern replace char filter pattern_replace 用正则表达式处理字符串

    推荐阅读