#|Elasticsearch term vector
Elasticsearch term vector
- 一. 概念
- 二. term vector数据的出现时机
- 三. 数据探查
- 3.1 最基本的数据探查
- 3.2 探查指定term的term vector
- 3.3 指定分词器探查 term vector
- 3.4 term vector filter
- 3.5 multi term vector
一. 概念 term vector用于获取document中某个field内各个不可分割的term(词条)的相关统计信息,它们包括以下内容:
- term information: term frequency in the field,term在一个field中出现的次数。
- term positions: term在field中出现的下标。
- start and end offsets:起始和结束下标,包含起始不包含结束。比如某个document的field为"abc def ghi",那么abc的起始下标为0,结束下标为3。
- term payloads: term的编号,由Elasticsearch维护。
- term statistics: 词条的统计信息,当我们把term_statistic设置成true时生效。词条的统计信息包括: total term frequency(一个term在所有document中出现的频率)、document frequency(有多少个document包含这个term)。
- field statistic: 属性字段的统计信息,包括: document count(有多少个document包含这个field),sum of document frequency(一个document中所有field的document frequency之和),sum of total term frequency(一个field所有term的term frequency in the field之和)
通常来说,term vector很少使用,一般只会在对某些数据进行数据探查时使用。比如美团上可以查看到顾客搜索热度最高的词语,用于搜索推荐和词条推荐。
二. term vector数据的出现时机 term vector涉及到了许多关于term和field的统计信息,Elasticsearch提供了两种方式来收集这些统计信息。
- index time
在创建index时,通过mapping内的设置开启term vector统计功能,当index创建完毕后,Elasticsearch也会随之完成统计信息的记录。index-time这种创建模式适用于那些需要被频繁进行term vector数据探查的index。
PUT index_name
{
"settings": {
"index": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"analysis": {
"analyzer": {
"fulltext_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"type_as_payload"
]
}
}
}
},
"mappings": {
"properties": {
"my_text": { # index time
"type": "text",
"term_vector": "with_positions_offsets_payloads", # term_vector有no、yes、with_offset、with_positions等可选值
"store": true,
"analyzer": "fulltext_analyzer"
},
"fullname": { # query time
"type": "text",
"analyzer": "fulltext_analyzer"
}
}
}
}
- query time
Elasticsearch在查询数据时,进行数据统计。这种方式又被称为"on the fly",适合在很少使用term vector数据探查的index中使用。上方案例中,fullname字段使用了query time。数据探查时,query time与index time使用的语法没有任何区别。如果没有特殊要求,那么使用query time就足够了,index写操作的效率上要比index time高。
POST /index_name/_doc/1
{
"fullname" : "Kerwin Kim",
"my_text" : "hello test test test "
}PUT /index_name/_doc/2
{
"fullname" : "Kerwin Kim",
"my_text" : "other hello test ..."
}
三. 数据探查 3.1 最基本的数据探查 使用termvectors api探查某一个document中的term vector统计信息。
GET /index_name/_doc/1/_termvectors
{
"fields" : ["my_text"],
"offsets" : true,
"payloads" : true,
"positions" : true,
"term_statistics" : true,
"field_statistics" : true
}
得到结果
{
"_index" : "index_name",
"_type" : "_doc",
"_id" : "1",
"_version" : 1,
"found" : true,
"took" : 2,
"term_vectors" : {
"my_text" : {
"field_statistics" : {
"sum_doc_freq" : 6,
"doc_count" : 2,
"sum_ttf" : 8
},
"terms" : {
"hello" : {
"doc_freq" : 2,
"ttf" : 2,
"term_freq" : 1,
"tokens" : [
{
"position" : 0,
"start_offset" : 0,
"end_offset" : 5,
"payload" : "d29yZA=="
}
]
},
"test" : {
"doc_freq" : 2,
"ttf" : 4,
"term_freq" : 3,
"tokens" : [
{
"position" : 1,
"start_offset" : 6,
"end_offset" : 10,
"payload" : "d29yZA=="
},
{
"position" : 2,
"start_offset" : 11,
"end_offset" : 15,
"payload" : "d29yZA=="
},
{
"position" : 3,
"start_offset" : 16,
"end_offset" : 20,
"payload" : "d29yZA=="
}
]
}
}
}
}
}
3.2 探查指定term的term vector 真实项目中,仅仅只是统计某一个document的term vector显然过于片面了,一般我们会针对某几个term在整个index中统计term vector。
在"doc"中写明需要探查的 term。
GET /index_name/_termvectors
{
"doc": {
"fullname": "Kerwin Kim",
"my_text": "hello test"
},
"fields" : ["my_text", "fullname"],
"offsets" : true,
"payloads" : true,
"positions" : true,
"term_statistics" : true,
"field_statistics" : true
}
得到结果:
{
"_index" : "index_name",
"_type" : "_doc",
"_version" : 0,
"found" : true,
"took" : 8,
"term_vectors" : {
"fullname" : {
"field_statistics" : {
"sum_doc_freq" : 4,
"doc_count" : 2,
"sum_ttf" : 4
},
"terms" : {
"kerwin" : {
"doc_freq" : 2,
"ttf" : 2,
"term_freq" : 1,
"tokens" : [
{
"position" : 0,
"start_offset" : 0,
"end_offset" : 6
}
]
},
"kim" : {
"doc_freq" : 2,
"ttf" : 2,
"term_freq" : 1,
"tokens" : [
{
"position" : 1,
"start_offset" : 7,
"end_offset" : 10
}
]
}
}
},
"my_text" : {
"field_statistics" : {
"sum_doc_freq" : 6,
"doc_count" : 2,
"sum_ttf" : 8
},
"terms" : {
"hello" : {
"doc_freq" : 2,
"ttf" : 2,
"term_freq" : 1,
"tokens" : [
{
"position" : 0,
"start_offset" : 0,
"end_offset" : 5
}
]
},
"test" : {
"doc_freq" : 2,
"ttf" : 4,
"term_freq" : 1,
"tokens" : [
{
"position" : 1,
"start_offset" : 6,
"end_offset" : 10
}
]
}
}
}
}
}
3.3 指定分词器探查 term vector 如果doc中需要探查的term不想使用创建index时指定的分词器,则我们可以使用per_field_analyzer来分别指定doc中每一个field使用的分词器。
比如下述语句中,针对my_text字段指定了"english"分词器,而非创建index时指定的"fulltext_analyzer"分词器。(english会忽略时态,testing->test)
GET /index_name/_termvectors
{
"doc": {
"fullname": "Kerwin Kim",
"my_text": "hello testing"
},
"fields" : ["my_text", "fullname"],
"offsets" : true,
"payloads" : true,
"positions" : true,
"term_statistics" : true,
"field_statistics" : true,
"per_field_analyzer": {
"my_text": "english",
"fullname": "standard"
}
}
3.4 term vector filter 对index进行数据探查后,得到的结果中并非都是我们想要的数据,Elasticsearch可以帮助我们过滤掉这部分数据。
过滤时,使用了以下api:
- max_num_terms: 最多对多少个term进行数据探查。
- min_term_freq: term在一个field中最少出现多少次。
- min_doc_freq: term至少在多少个document中出现过。
GET /index_name/_termvectors
{
"doc": {
"fullname": "Kerwin Kim",
"my_text": "hello test"
},
"fields" : ["my_text", "fullname"],
"offsets" : true,
"payloads" : true,
"positions" : true,
"term_statistics" : true,
"field_statistics" : true,
"per_field_analyzer": {
"my_text": "english"
},
"filter": {
"max_doc_freq": 1,
"min_term_freq": 2,
"max_num_terms": 3
}
}
3.5 multi term vector 一次性对多个document进行数据探查,可以看作是对3.1节的补充。
GET _mtermvectors
{
"docs": [
{
"_index": "index_name",
"_id": 1,
"term_statistics": true,
"offsets": false
},
{
"_index": "index_name",
"_id": 2,
"fields": [
"my_text"
],
"offsets": true
}
]
}
【#|Elasticsearch term vector】得到结果:
{
"docs" : [
{
"_index" : "index_name",
"_type" : "_doc",
"_id" : "1",
"_version" : 1,
"found" : true,
"took" : 0,
"term_vectors" : {
"my_text" : {
"field_statistics" : {
"sum_doc_freq" : 6,
"doc_count" : 2,
"sum_ttf" : 8
},
"terms" : {
"hello" : {
"doc_freq" : 2,
"ttf" : 2,
"term_freq" : 1,
"tokens" : [
{
"position" : 0,
"payload" : "d29yZA=="
}
]
},
"test" : {
"doc_freq" : 2,
"ttf" : 4,
"term_freq" : 3,
"tokens" : [
{
"position" : 1,
"payload" : "d29yZA=="
},
{
"position" : 2,
"payload" : "d29yZA=="
},
{
"position" : 3,
"payload" : "d29yZA=="
}
]
}
}
}
}
},
{
"_index" : "index_name",
"_type" : "_doc",
"_id" : "2",
"_version" : 1,
"found" : true,
"took" : 0,
"term_vectors" : {
"my_text" : {
"field_statistics" : {
"sum_doc_freq" : 6,
"doc_count" : 2,
"sum_ttf" : 8
},
"terms" : {
"..." : {
"term_freq" : 1,
"tokens" : [
{
"position" : 3,
"start_offset" : 17,
"end_offset" : 20,
"payload" : "d29yZA=="
}
]
},
"hello" : {
"term_freq" : 1,
"tokens" : [
{
"position" : 1,
"start_offset" : 6,
"end_offset" : 11,
"payload" : "d29yZA=="
}
]
},
"other" : {
"term_freq" : 1,
"tokens" : [
{
"position" : 0,
"start_offset" : 0,
"end_offset" : 5,
"payload" : "d29yZA=="
}
]
},
"test" : {
"term_freq" : 1,
"tokens" : [
{
"position" : 2,
"start_offset" : 12,
"end_offset" : 16,
"payload" : "d29yZA=="
}
]
}
}
}
}
}
]
}
推荐阅读
- NeuVector 会是下一个爆款云原生安全神器吗()
- ElasticSearch6.6.0强大的JAVA|ElasticSearch6.6.0强大的JAVA API详解
- Elasticsearch|Elasticsearch 简介
- elasticsearch分析器
- 三十一、|三十一、 Elasticsearch集群搭建部署及配置
- springmvc|springmvc 集成 Spring Data Elasticsearch 遇到的坑
- Elasticsearch(一)什么是Elasticsearch
- Crack|vectordraw图形库,提高了 WebGL 3D 渲染模式的性能
- elasticsearch|elasticsearch 7.0 新特性之 search as you type
- Elasticsearch|Elasticsearch 7.x 深入【10】Aggregation