#|Elasticsearch term vector


Elasticsearch term vector

  • 一. 概念
  • 二. term vector数据的出现时机
  • 三. 数据探查
    • 3.1 最基本的数据探查
    • 3.2 探查指定term的term vector
    • 3.3 指定分词器探查 term vector
    • 3.4 term vector filter
    • 3.5 multi term vector

一. 概念 term vector用于获取document中某个field内各个不可分割的term(词条)的相关统计信息,它们包括以下内容:
  1. term information: term frequency in the field,term在一个field中出现的次数。
  2. term positions: term在field中出现的下标。
  3. start and end offsets:起始和结束下标,包含起始不包含结束。比如某个document的field为"abc def ghi",那么abc的起始下标为0,结束下标为3。
  4. term payloads: term的编号,由Elasticsearch维护。
  5. term statistics: 词条的统计信息,当我们把term_statistic设置成true时生效。词条的统计信息包括: total term frequency(一个term在所有document中出现的频率)、document frequency(有多少个document包含这个term)。
  6. field statistic: 属性字段的统计信息,包括: document count(有多少个document包含这个field),sum of document frequency(一个document中所有field的document frequency之和),sum of total term frequency(一个field所有term的term frequency in the field之和)
Elasticsearch官方指出,term statistic和field statistic 并不准确,在统计时不会考虑某些document已经被删除的情况。这是因为Elasticsearch在收到删除请求后,只是简单的在数据上更新被删除的标记,并不会立刻删除数据。
通常来说,term vector很少使用,一般只会在对某些数据进行数据探查时使用。比如美团上可以查看到顾客搜索热度最高的词语,用于搜索推荐和词条推荐。
二. term vector数据的出现时机 term vector涉及到了许多关于term和field的统计信息,Elasticsearch提供了两种方式来收集这些统计信息。
  1. index time
    在创建index时,通过mapping内的设置开启term vector统计功能,当index创建完毕后,Elasticsearch也会随之完成统计信息的记录。index-time这种创建模式适用于那些需要被频繁进行term vector数据探查的index。
举例,仔细观察my_text和fullname数据结构的区别:
PUT index_name { "settings": { "index": { "number_of_shards": 1, "number_of_replicas": 0 }, "analysis": { "analyzer": { "fulltext_analyzer": { "type": "custom", "tokenizer": "whitespace", "filter": [ "lowercase", "type_as_payload" ] } } } }, "mappings": { "properties": { "my_text": { # index time "type": "text", "term_vector": "with_positions_offsets_payloads", # term_vector有no、yes、with_offset、with_positions等可选值 "store": true, "analyzer": "fulltext_analyzer" }, "fullname": { # query time "type": "text", "analyzer": "fulltext_analyzer" } } } }

  1. query time
    Elasticsearch在查询数据时,进行数据统计。这种方式又被称为"on the fly",适合在很少使用term vector数据探查的index中使用。上方案例中,fullname字段使用了query time。数据探查时,query time与index time使用的语法没有任何区别。如果没有特殊要求,那么使用query time就足够了,index写操作的效率上要比index time高。
测试数据:
POST /index_name/_doc/1 { "fullname" : "Kerwin Kim", "my_text" : "hello test test test " }PUT /index_name/_doc/2 { "fullname" : "Kerwin Kim", "my_text" : "other hello test ..." }

三. 数据探查 3.1 最基本的数据探查 使用termvectors api探查某一个document中的term vector统计信息。
GET /index_name/_doc/1/_termvectors { "fields" : ["my_text"], "offsets" : true, "payloads" : true, "positions" : true, "term_statistics" : true, "field_statistics" : true }

得到结果
{ "_index" : "index_name", "_type" : "_doc", "_id" : "1", "_version" : 1, "found" : true, "took" : 2, "term_vectors" : { "my_text" : { "field_statistics" : { "sum_doc_freq" : 6, "doc_count" : 2, "sum_ttf" : 8 }, "terms" : { "hello" : { "doc_freq" : 2, "ttf" : 2, "term_freq" : 1, "tokens" : [ { "position" : 0, "start_offset" : 0, "end_offset" : 5, "payload" : "d29yZA==" } ] }, "test" : { "doc_freq" : 2, "ttf" : 4, "term_freq" : 3, "tokens" : [ { "position" : 1, "start_offset" : 6, "end_offset" : 10, "payload" : "d29yZA==" }, { "position" : 2, "start_offset" : 11, "end_offset" : 15, "payload" : "d29yZA==" }, { "position" : 3, "start_offset" : 16, "end_offset" : 20, "payload" : "d29yZA==" } ] } } } } }

3.2 探查指定term的term vector 真实项目中,仅仅只是统计某一个document的term vector显然过于片面了,一般我们会针对某几个term在整个index中统计term vector。
在"doc"中写明需要探查的 term。
GET /index_name/_termvectors { "doc": { "fullname": "Kerwin Kim", "my_text": "hello test" }, "fields" : ["my_text", "fullname"], "offsets" : true, "payloads" : true, "positions" : true, "term_statistics" : true, "field_statistics" : true }

得到结果:
{ "_index" : "index_name", "_type" : "_doc", "_version" : 0, "found" : true, "took" : 8, "term_vectors" : { "fullname" : { "field_statistics" : { "sum_doc_freq" : 4, "doc_count" : 2, "sum_ttf" : 4 }, "terms" : { "kerwin" : { "doc_freq" : 2, "ttf" : 2, "term_freq" : 1, "tokens" : [ { "position" : 0, "start_offset" : 0, "end_offset" : 6 } ] }, "kim" : { "doc_freq" : 2, "ttf" : 2, "term_freq" : 1, "tokens" : [ { "position" : 1, "start_offset" : 7, "end_offset" : 10 } ] } } }, "my_text" : { "field_statistics" : { "sum_doc_freq" : 6, "doc_count" : 2, "sum_ttf" : 8 }, "terms" : { "hello" : { "doc_freq" : 2, "ttf" : 2, "term_freq" : 1, "tokens" : [ { "position" : 0, "start_offset" : 0, "end_offset" : 5 } ] }, "test" : { "doc_freq" : 2, "ttf" : 4, "term_freq" : 1, "tokens" : [ { "position" : 1, "start_offset" : 6, "end_offset" : 10 } ] } } } } }

3.3 指定分词器探查 term vector 如果doc中需要探查的term不想使用创建index时指定的分词器,则我们可以使用per_field_analyzer来分别指定doc中每一个field使用的分词器。
比如下述语句中,针对my_text字段指定了"english"分词器,而非创建index时指定的"fulltext_analyzer"分词器。(english会忽略时态,testing->test)
GET /index_name/_termvectors { "doc": { "fullname": "Kerwin Kim", "my_text": "hello testing" }, "fields" : ["my_text", "fullname"], "offsets" : true, "payloads" : true, "positions" : true, "term_statistics" : true, "field_statistics" : true, "per_field_analyzer": { "my_text": "english", "fullname": "standard" } }

3.4 term vector filter 对index进行数据探查后,得到的结果中并非都是我们想要的数据,Elasticsearch可以帮助我们过滤掉这部分数据。
过滤时,使用了以下api:
  1. max_num_terms: 最多对多少个term进行数据探查。
  2. min_term_freq: term在一个field中最少出现多少次。
  3. min_doc_freq: term至少在多少个document中出现过。
GET /index_name/_termvectors { "doc": { "fullname": "Kerwin Kim", "my_text": "hello test" }, "fields" : ["my_text", "fullname"], "offsets" : true, "payloads" : true, "positions" : true, "term_statistics" : true, "field_statistics" : true, "per_field_analyzer": { "my_text": "english" }, "filter": { "max_doc_freq": 1, "min_term_freq": 2, "max_num_terms": 3 } }

3.5 multi term vector 一次性对多个document进行数据探查,可以看作是对3.1节的补充。
GET _mtermvectors { "docs": [ { "_index": "index_name", "_id": 1, "term_statistics": true, "offsets": false }, { "_index": "index_name", "_id": 2, "fields": [ "my_text" ], "offsets": true } ] }

【#|Elasticsearch term vector】得到结果:
{ "docs" : [ { "_index" : "index_name", "_type" : "_doc", "_id" : "1", "_version" : 1, "found" : true, "took" : 0, "term_vectors" : { "my_text" : { "field_statistics" : { "sum_doc_freq" : 6, "doc_count" : 2, "sum_ttf" : 8 }, "terms" : { "hello" : { "doc_freq" : 2, "ttf" : 2, "term_freq" : 1, "tokens" : [ { "position" : 0, "payload" : "d29yZA==" } ] }, "test" : { "doc_freq" : 2, "ttf" : 4, "term_freq" : 3, "tokens" : [ { "position" : 1, "payload" : "d29yZA==" }, { "position" : 2, "payload" : "d29yZA==" }, { "position" : 3, "payload" : "d29yZA==" } ] } } } } }, { "_index" : "index_name", "_type" : "_doc", "_id" : "2", "_version" : 1, "found" : true, "took" : 0, "term_vectors" : { "my_text" : { "field_statistics" : { "sum_doc_freq" : 6, "doc_count" : 2, "sum_ttf" : 8 }, "terms" : { "..." : { "term_freq" : 1, "tokens" : [ { "position" : 3, "start_offset" : 17, "end_offset" : 20, "payload" : "d29yZA==" } ] }, "hello" : { "term_freq" : 1, "tokens" : [ { "position" : 1, "start_offset" : 6, "end_offset" : 11, "payload" : "d29yZA==" } ] }, "other" : { "term_freq" : 1, "tokens" : [ { "position" : 0, "start_offset" : 0, "end_offset" : 5, "payload" : "d29yZA==" } ] }, "test" : { "term_freq" : 1, "tokens" : [ { "position" : 2, "start_offset" : 12, "end_offset" : 16, "payload" : "d29yZA==" } ] } } } } } ] }

    推荐阅读