Elasticsearch-mapping Elasticsearch-mapping

Mapping

概念：mapping就是ES数据字段field的type类型元数据，ES在创建索引的时候，动态映射(dynamic mapping) 会自动为不同的啥数据指定响应的mapping，mapping中包含了字段类型、搜索方式（精准匹配和全文检索）、分词器等。
查看mapping

GET /product/_mapping
{
"product" : {
"mappings" : {
"properties" : {
"desc" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"name" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"price" : {
"type" : "long"
},
"tags" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
搜索方式：
- 精确匹配（exact value）：在倒排索引过程中，分词器会将field作为一个整体创建到索引中。
- 全文匹配（full text）：分词、近义词、同义词、混淆词、大小写、词性、过滤、事态转换等
动态映射（dynamic mapping）
- 常见类型转换：
  
  数据类型
  
  "Elasticsearch" text/keyword
  
  123456 long
  
  123.123 double
  
  true/false boolean
  
  2020-04-12 date
- 数字 123456 为何转为long？
  因为es的mapping_type是由JSON分析器检测数据类型，而JSON没有隐式类型转换（integer->long float->double），故dynamic mapping 会选择比较宽的数据类型。
- 123.123 为何转为官网为double而实际为float？
  也许为es潜在bug。

数据	类型
"Elasticsearch"	text/keyword
123456	long
123.123	double
true/false	boolean
2020-04-12	date

ES数据类型

核心数据类型
- 数字类型：
  1. long、integer、short、byte、double、float、half_float、scaled_float
  2. 在满足需求的情况下，尽可能选择方位小的数据类型。
  3. 浮点类型
    
    类型取值范围
    
    double 64位双精度
    
    float 32位单精度
    
    half_float 16位半精度
    
    scaled_float 缩放类型的浮点数
  4. 对于float、half_float、scaled_float，-0.0和+0.0是不同的值，使用term查询查找-0.0不会匹配+0.0，同样range查询中上边界是-0.0不会匹配+0.0，下边界是+0.0不会匹配-0.0。
  5. 对于scaled_float，比如价格只需要精确到分，price为67.34的字段缩放因子为100,存起来就是5734.
  6. 有限考虑使用带缩放因子的scaled_float浮点类型。
- 字符串
  1. keyword：适用于索引结构化的字段，可以用于过滤、聚合。keyword类型字段只能通过精确度（exact value）搜索到。Id应该用keyword。
  2. text：当一个字段是要被全文搜索，比如Email、内容、产品描述，这些字段应该使用text类型，设置text类型后，字段内容会被分析，在生成倒排索引以前，字符串会被分析器分成一个一个词项，text类型的字段不用于排序，很少用于聚合。
  3. 问何text不会创建索引？字段数据会占用大量空间，尤其是在加载高基数text字段时，字段数据一旦加载到堆中，就在该字段生命周期内保存在哪里，同样，加载字段数据是一个昂贵的过程，可能导致用户遇到延迟问题。
  4. 在同一字段中同事具有全文本（text）和关键字（keyword）版本会很有用，一个用于全文本搜索，另一个用于聚合和排序。
- date（时间类型）：精确查找（exact value）
- boolean（布尔类型）
- 二级制类型（binary）
- 区间类型（range）：integer_range、float_range、long_range、double_range、date_range。
复杂类型
- Object：用于单个JSON对象
- Nested：用于JSON对象数组
地理位置
- Geo-point：纬度/经度积分
- Geo-shape：用于多边形等复杂形状
特有类型：
- IP地址：ip 用于IPv4和IPv6地址
- Completion：提供自动完成建议
- Tocken_count：计算字符串中令牌的数量
- Murmur3：在索引时计算值的哈希并将其存储在索引中
- Annotated-text：索引包含特殊标记的文本（通常用于标识命名实体）
- Percolator：接受来自query-dsl的查询
- Join：为同一索引内的文档定义父/子关系
- Rank features：记录数字功能以提高查询时的点击率。
- Dense vector：记录浮点值的密集向量。
- Sparse vector：记录浮点值的稀疏向量。
- Search-as-you-type：针对查询优化的文本字段，以实现按需输入的完成
- Alias：为现有字段定义别名。
- Flattened：允许将整个JSON对象索引为单个字段。
- Shape：shape 对于任意笛卡尔几何。
- Histogram：histogram 用于百分位数聚合的预聚合数值。
- ：keyword当所有文档都具有相同值时的情况的专业化。
Array（数组）：在Elasticsearch中，数组不需要专用的字段数据类型。默认情况下，任何字段都可以包含零个或多个值，但是，数组中的所有值都必须具有相同的数据类型。
ES 7新增：
- Date_nanos：date plus 纳秒

类型	取值范围
double	64位双精度
float	32位单精度
half_float	16位半精度
scaled_float	缩放类型的浮点数

手工创建mapping fields的mapping只能创建，无法修改。

分词
- 语法
  
  GET /_analyze
  {
  "analyzer": "standard"
  , "text": ["2020-05-20"]
  }
- 动态映射（dynamic mapping）
  - 语法
    
    PUT /dm/_doc/1
    {
    "name": "xiaomi phone",
    "desc": "shouji zhong de zhandouji",
    "count": 123456,
    "price": 123.123,
    "date": "2020-05-20",
    "isdel": false,
    "tags": [
    "xingjiabi",
    "fashao",
    "buka"
    ]
    }
  - 查看映射
    
    {
    "dm" : {
    "mappings" : {
    "properties" : {
    "count" : {
    "type" : "long"
    },
    "date" : {
    "type" : "date"
    },
    "desc" : {
    "type" : "text",
    "fields" : {
    "keyword" : {
    "type" : "keyword",
    "ignore_above" : 256
    }
    }
    },
    "isdel" : {
    "type" : "boolean"
    },
    "name" : {
    "type" : "text",
    "fields" : {
    "keyword" : {
    "type" : "keyword",
    "ignore_above" : 256
    }
    }
    },
    "price" : {
    "type" : "float"
    },
    "tags" : {
    "type" : "text",
    "fields" : {
    "keyword" : {
    "type" : "keyword",
    "ignore_above" : 256
    }
    }
    }
    }
    }
    }
    }
    ?
  - *****正排索引（doc_values）、倒排索引（index）和fielddata*****
    - ****内存：doc_value 和 index 使用的是系统内存，fielddata使用的是jvm内存。****
    - ****设置index=false 则无法进行搜索，_source值存在；设置doc_values=false 则无法进行聚合；doc_values和index一旦设置，则无法修改，除非重建索引。如果不行重建索引而想聚合的话，只能设置field_data=https://www.it610.com/article/true，以达到字段聚合目的。fielddata为聚合而生。****
    - ****优化：es官方建议，es是基于大量os cache来进行缓存以提高性能。不建议用jvm内存来进行缓存，jvm缓存会导致一定的gc开销和oom问题，给jvm更少内存，给os cache更大的内存。比如64G服务器，给jvm最多4-16G内存（1/16~~1/4）, os cache可以提高doc value和倒排索引的缓存、****
    - ****查询效率。****
  - mapping parameters
    - index：是否对创建当前字段创建索引，默认为true，如果不创建索引，该字段不会通过索引被搜索到，但是仍然会在_source元数据中展示。
    - analyzer：指定分析器。
    - boost：对当前字段相关度评分权重，默认为1。
    - coerce：是否允许强制类型转换。true："1" 可以转为为 1；fasle："1" 转为 1 是报错。
      
      DELETE coerce
      ?
      PUT coerce
      {
      "mappings": {
      "properties": {
      "number_one": {
      "type": "integer"
      },
      "number_two": {
      "type": "integer",
      "coerce": false
      }
      }
      }
      }
      ?
      PUT coerce/_doc/1
      {
      "number_one": "10"
      }
      ?
      拒绝，因为设置了false PUT coerce/_doc/2
      {
      "number_two": "10"
      }
      ?
      错误信息 {
      "error" : {
      "root_cause" : [
      {
      "type" : "mapper_parsing_exception",
      "reason" : "failed to parse field [number_two] of type [integer] in document with id '2'. Preview of field's value: '10'"
      }
      ],
      "type" : "mapper_parsing_exception",
      "reason" : "failed to parse field [number_two] of type [integer] in document with id '2'. Preview of field's value: '10'",
      "caused_by" : {
      "type" : "illegal_argument_exception",
      "reason" : "Integer value passed as String"
      }
      },
      "status" : 400
      }
      整个mapping设置强制类型转换 DELETE coerce
      ?
      PUT coerce
      {
      "settings": {
      "index.mapping.coerce": false
      },
      "mappings": {
      "properties": {
      "number_one": {
      "type": "integer",
      "coerce": true
      },
      "number_two": {
      "type": "integer"
      }
      }
      }
      }
      ?
      PUT coerce/_doc/1
      {
      "number_one": "10"
      }
      ?
      拒绝，因为设置了false PUT coerce/_doc/2
      {
      "number_two": "10"
      }
    - copy_to：ES就是内容拼接，放到一个新字段里，所以索引时间会增加，聚合性能取决于和之前的那个字段比较。
      
      DELETE copy_to
      ?
      PUT copy_to
      {
      "mappings": {
      "properties": {
      "field1": {
      "type": "text",
      "copy_to": "field_all"
      },
      "field2": {
      "type": "text",
      "copy_to": "field_all"
      },
      "field_all": {
      "type": "text"
      }
      }
      }
      }
      ?
      PUT copy_to/_doc/1
      {
      "field1": "field1",
      "field2": "field2"
      }
      ?
      GET copy_to/_search
      ?
      GET copy_to/_search
      {
      "query": {
      "match": {
      "field_all": {
      "query": "field1 field2"
      }
      }
      }
      }
    - doc_value：为了提升排序和聚合效率，默认为true，如果确定不需要对字段进行排序和聚合，也不需要通过脚本访问字段值，则可以禁用doc值，以节省磁盘空间（不支持text 和 annotated_text）
    - dynamic：控制是否可以动态添加新字段
      1. true：新检测到的字段将添加到映射中（默认）。
      2. false：新检测到的字段将被忽略。这些字段将不会被索引，因此将无法搜索，但仍会出现在_source返回的匹配项中，这些字段不会被添加到映射中，必须显示添加字段。
      3. strict：如果检测到新字段，则会引发异常，并拒绝文档，必须将新字段显示添加到映射中。
    - eager_global_ordinals：用于聚合的字段上，优化聚合性能。
      1. Frozen indices（冻结索引）：有些索引使用率很高，会被保存在的内存中，有些使用率特别低，宁愿在使用的时候重新创建索引，在使用完毕丢弃数据，Frozen indices 的数据命中频率小，不适用与高搜索负载，数据不会被保存在内存中，堆空间占用比普通索引少得多，Frozen indices 是只是读的，请求可能是秒级或者分钟级。
      2. eager_global_ordinals 不适用与Frozen indices
    - enable：是否创建倒排索引，可以对字段操作，也可以对索引操作，如果不创建索引，仍然可以检索并在_source元数据中展示，谨慎使用，该状态无法修改。
      
      // 操作索引
      PUT my_index
      {
      "mappings": {
      "enabled": false
      }
      }
      // 删除索引
      DELETE my_index
      // 操作字段
      PUT my_index
      {
      "mappings": {
      "properties": {
      "session_data": {
      "type": "object",
      "enabled": false
      }
      }
      }
      }
    - term_vector：
    - store：设置字段是否仅查询
    - similarity：为字段设置相关度算法，支持BM25、claassic（TF-IDF）、boolean。
    - ****search_analyzer：设置单独的查询分析器
      
      **DELETE my_index
      PUT my_index
      {
      "settings": {
      "analysis": {
      "filter": {
      "autocomplete_filter": {
      "type": "edge_ngram",
      "min_gram": 1,
      "max_gram": 20
      }
      },
      "analyzer": {
      "autocomplete": {
      "type": "custom",
      "tokenizer": "standard",
      "filter": [
      "lowercase",
      "autocomplete_filter"
      ]
      }
      }
      }
      },
      "mappings": {
      "properties": {
      "text": {
      "type": "text",
      "analyzer": "autocomplete",
      "search_analyzer": "standard"
      }
      }
      }
      }
      PUT my_index/_doc/1
      {
      "text": "Quick Brown Fox"
      }
      GET my_index/_search
      {
      "query": {
      "match": {
      "text": {
      "query": "Quick Br",
      "operator": "and"
      }
      }
      }
      }**
    - proterties：除了mapping还可用于object的属性设置
    - position_increment_gap：
    - null_value：为null值设置默认值。"null_value": "NULL"
    - norms：是否禁用评分（在filter和聚合字段上应该禁用）。
    - normalizer：
    - meta：附加元数据
    - Index_prefixes：前缀搜索
      1. min_chars：前缀最小长度，>0，默认2（包含）
      2. max_chars：前缀最大长度，<20，默认5（包含）
      3. 代码用例
        
        PUT /my_index
        {
        "mappings": {
        "properties": {
        "number_one": {
        "type": "text",
        "index_prefixes": {
        "min_chars": 1,
        "max_chars": 10
        }
        }
        }
        }
        }
    - Index_phrases：提升exact_value查询速度，但是要消耗更多磁盘空间
    - index_options：控制将那些信息添加到反向索引中，以进行搜索和突出显示。仅用于text字段。类型有：docs、freqs、positions和offsets。
    - ignore_malformed：忽略类型错误
      
      **DELETE my_index
      PUT /my_index
      {
      "mappings": {
      "properties": {
      "number_one": {
      "type": "integer",
      "ignore_malformed": true
      },
      "number_two": {
      "type": "integer"
      }
      }
      }
      }
      虽然有异常但是不抛出 PUT my_index/_doc/1
      {
      "text": "Some text value",
      "number_one": "foo"
      }
      【Elasticsearch-mapping】GET /my_index/_search
      {
      "query": {
      "match_all": {}
      }
      }
      查询结果 {
      "took" : 0,
      "timed_out" : false,
      "_shards" : {
      "total" : 1,
      "successful" : 1,
      "skipped" : 0,
      "failed" : 0
      },
      "hits" : {
      "total" : {
      "value" : 1,
      "relation" : "eq"
      },
      "max_score" : 1.0,
      "hits" : [
      {
      "_index" : "my_index",
      "_type" : "_doc",
      "_id" : "1",
      "_score" : 1.0,
      "_ignored" : [
      "number_one"
      ],
      "_source" : {
      "text" : "Some text value",
      "number_one" : "foo"
      }
      }
      ]
      }
      }
      数据格式不对 PUT my_index/_doc/2
      {
      "text": "Some text value",
      "number_two": "foo"
      }
      错误输出 {
      "error" : {
      "root_cause" : [
      {
      "type" : "mapper_parsing_exception",
      "reason" : "failed to parse field [number_two] of type [integer] in document with id '2'. Preview of field's value: 'foo'"
      }
      ],
      "type" : "mapper_parsing_exception",
      "reason" : "failed to parse field [number_two] of type [integer] in document with id '2'. Preview of field's value: 'foo'",
      "caused_by" : {
      "type" : "number_format_exception",
      "reason" : "For input string: "foo""
      }
      },
      "status" : 400
      }**
    - ****ignore_above：超过长度将被忽略
    - format：格式化
      
      PUT /map
      {
      "mappings": {
      "properties": {
      "date":{
      "type": "date"
      , "format": ["yyyy-MM-dd"]
      }
      }
      }
      }
    - ****fields** ：给field创建多字段，用于不同目的（全文检索或者聚合分析排序）。比如text字段keyword。**
      
      **DELETE fields_test
      给city创建一个keyword PUT fields_test
      {
      "mappings": {
      "properties": {
      "city": {
      "type": "text",
      "fields": {
      "raw": {
      "type": "keyword"
      }
      }
      }
      }
      }
      }
      PUT fields_test/_doc/1
      {
      "city": "New York"
      }
      PUT fields_test/_doc/2
      {
      "city": "York"
      }
      size = 0 表示不显示原始结果 GET fields_test/_search
      {
      "query": {
      "match": {
      "city": "york"
      }
      },
      "size": 0,
      "sort": {
      "city.raw": "asc"
      },
      "aggs": {
      "Cities": {
      "terms": {
      "field": "city.raw"
      }
      }
      }
      }**
    - fielddata：
      1. 文本（text）字段使用查询时内存中的数据接口。但我们首次将该字段用于聚合、排序或者在脚本中使用时，将按需构建此数据结构，它是通过从磁盘读取每个字段的整个反向索引，翻转术语<->文档关系并将结果存储在JVM堆中的内存中来构建的。
      2. fielddata会占用大量堆空间，尤其是在加载大量的文本字段时。一旦将自担加载到堆中，它在该字段的生命周期将一直保留在哪里。同样，加载字段数据是一个昂贵的过程，可以导致用户遇到延迟的情况。这是默认情况禁用字段数据的原因。
      3. 聚合出错样例代码：
        
        DELETE my_index
        ?
        // 默认fielddata为false
        PUT myindex
        {
        "mappings": {
        "properties": {
        "address": {
        "type": "text"
        }
        }
        }
        }
        PUT myindex/_doc/1
        {
        "address": "New York"
        }
        // 聚合
        GET myindex/_search
        {
        "aggs": {
        "arrs_name": {
        "terms": {
        "field": "address"
        }
        }
        }
        }
        ?
        // 聚合出错，出错原因为
        {
        "error" : {
        "root_cause" : [
        {
        "type" : "illegal_argument_exception",
        "reason" : "Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=https://www.it610.com/article/true on [address] in order to load field data by uninverting the inverted index. Note that this can use significant memory."
        }
        ]
        }
        }
      4. text类型聚合类型（fielddata=https://www.it610.com/article/true）
        
        **DELETE myindex
        PUT myindex
        {
        "mappings": {
        "properties": {
        "address": {
        "type": "text",
        "fielddata": true
        }
        }
        }
        }
        PUT myindex/_doc/1
        {
        "address": "New York"
        }
        GET myindex/_search
        {
        "aggs": {
        "arrs_name": {
        "terms": {
        "field": "address"
        }
        }
        }
        }
        // 聚合结果
        {
        "took" : 2,
        "timed_out" : false,
        "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
        },
        "hits" : {
        "total" : {
        "value" : 1,
        "relation" : "eq"
        },
        "max_score" : 1.0,
        "hits" : [
        {
        "_index" : "myindex",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
        "address" : "New York"
        }
        }
        ]
        },
        "aggregations" : {
        "arrs_name" : {
        "doc_count_error_upper_bound" : 0,
        "sum_other_doc_count" : 0,
        "buckets" : [
        {
        "key" : "new",
        "doc_count" : 1
        },
        {
        "key" : "york",
        "doc_count" : 1
        }
        ]
        }
        }
        }**