Elasticsearch(基于 Vector 的打分)
目前这个功能还是处于试验阶段。在未来的版本中可能会出现变化。基于 Vector(向量)的打分目前分为一下两种:
- Dense_vector
- Spare_vector
准备数据 我们首先创建一个叫做 books 的索引,并定义它的 mapping 如下:
PUT books
{
"mappings": {
"properties": {
"author": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"category": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"format": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"isbn13": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"pages": {
"type": "long"
},
"price": {
"type": "float"
},
"publisher": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"rating": {
"type": "float"
},
"release_year": {
"type": "date",
"format": "strict_year"
},
"title": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"vector_recommendation": {
"type": "dense_vector",
"dims": 3
}
}
}
}
然后,我们使用 bulk API 接口来导入数据:
PUT books/_bulk
{ "index" : { "_id" : "database-internals" } }
{"isbn13":"978-1492040347","author":"Alexander Petrov", "title":"Database Internals: A deep-dive into how distributed data systems work","publisher":"O'Reilly","category":["databases","information systems"],"pages":350,"price":47.28,"format":"paperback","rating":4.5, "release_year" : "2019", "vector_recommendation" : [3.5, 4.5, 5.2]}
{ "index" : { "_id" : "designing-data-intensive-applications" } }
{"isbn13":"978-1449373320", "author":"Martin Kleppmann", "title":"Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems","publisher":"O'Reilly","category":["databases" ],"pages":590,"price":31.06,"format":"paperback","rating":4.4, "release_year" : "2017", "vector_recommendation" : [5.9, 4.4, 6.8]}
{ "index" : { "_id" : "kafka-the-definitive-guide" } }
{"isbn13":"978-1491936160","author":[ "Neha Narkhede", "Gwen Shapira", "Todd Palino"], "title":"Kafka: The Definitive Guide: Real-time data and stream processing at scale", "publisher":"O'Reilly","category":["databases" ],"pages":297,"price":37.31,"format":"paperback","rating":3.9, "release_year" : "2017", "vector_recommendation" : [2.97, 3.9, 6.2]}
{ "index" : { "_id" : "effective-java" } }
{"isbn13":"978-1491936160","author": "Joshua Block", "title":"Effective Java", "publisher":"Addison-Wesley", "category":["programming languages", "java" ],"pages":412,"price":27.91,"format":"paperback","rating":4.2, "release_year" : "2017", "vector_recommendation" : [4.12, 4.2, 7.2]}
{ "index" : { "_id" : "daemon" } }
{"isbn13":"978-1847249616","author":"Daniel Suarez", "title":"Daemon","publisher":"Quercus","category":["dystopia","novel"],"pages":448,"price":12.03,"format":"paperback","rating":4.0, "release_year" : "2011", "vector_recommendation" : [4.48, 4.0, 8.7]}
{ "index" : { "_id" : "cryptonomicon" } }
{"isbn13":"978-1847249616","author":"Neal Stephenson", "title":"Cryptonomicon","publisher":"Avon","category":["thriller", "novel" ],"pages":1152,"price":6.99,"format":"paperback","rating":4.0, "release_year" : "2002", "vector_recommendation" : [10.0, 4.0, 9.3]}
{ "index" : { "_id" : "garbage-collection-handbook" } }
{"isbn13":"978-1420082791","author": [ "Richard Jones", "Antony Hosking", "Eliot Moss" ], "title":"The Garbage Collection Handbook: The Art of Automatic Memory Management","publisher":"Taylor & Francis","category":["programming algorithms" ],"pages":511,"price":87.85,"format":"paperback","rating":5.0, "release_year" : "2011", "vector_recommendation" : [5.1, 5.0, 1.3] }
{ "index" : { "_id" : "radical-candor" } }
{"isbn13":"978-1250258403","author": "Kim Scott", "title":"Radical Candor: Be a Kick-Ass Boss Without Losing Your Humanity","publisher":"Macmillan","category":["human resources","management", "new work"],"pages":404,"price":7.29,"format":"paperback","rating":4.0, "release_year" : "2018", "vector_recommendation" : [4.0, 4.0, 9.2] }
这样我们的索引 books 中有8个文档。我们仔细地查看一下我们输入的数据,它里面含有一个叫做 vecto_recommendation 的字段
"vector_recommendation" : [3.5, 4.5, 5.2]
- 向量里的第一个数据3.5,实际上是我们的在这个文档里的 pages 除以100而得到的。如果这本书的页数越多,则表示这个数值越大。它的范围在0-10之间
- 向量里的第二个数据是这本书的 rating,也即评价。这个值在这个文档里的 rating 项可以查到。范围在0-5之间
- 向量里的第三个数据是书的价钱,这个值越低表明,价钱越贵,因为我们都喜欢便宜一点的书籍。0代表100元以上,10则表示10元以下的书
"vector_recommendation": {
"type": "dense_vector",
"dims": 3
}
它定义了这个 vector_recommendation 的类型是 dense_vector,它是一个3维的向量。
现在我们的数据都已经准备好了。我们接下来做一些我们喜欢的搜索。
搜索短的,便宜的并且评价高的书
在上面我们已经建立了我们的向量模型。那么我们怎么能够找到那些书的页数比较少,便宜的而且评价非常高的书呢?我们可以采用如下的搜索方式:
GET books/_search
{
"query": {
"script_score": {
"query": {
"match_all": {}
},
"script": {
"source": "cosineSimilarity(params.query_vector, doc['vector_recommendation']) + 1.0",
"params": {
"query_vector": [
1,
5,
10
]
}
}
}
}
}
在这里,我们使用 script_score。如果你对这个不是很了解的话,可以参阅我之前的文章 “Elasticsearch:使用function_score及soft_score定制搜索结果的分数”来做更进一步的了解。
在上面的搜索中,我们通过脚本:
cosineSimilarity(params.query_vector, doc['vector_recommendation']) + 1.0
来计算我们的搜索的分数。这里加上1的作用是为了避免我们最后的分数是负数。
在上面的表达式中:
"params": {
"query_vector": [
1,
5,
10
]
}
【Elasticsearch(基于 Vector 的打分)】我们想寻找的书是最好是100页的书,因为第一项是1;我们也同时想找一个评价好的书,因为第二项是5;同时我们想找最便宜的书,因为第三项是10。按照上面的要求,我们可以得到如下的搜索结果:
"hits" : [
{
"_index" : "books",
"_type" : "_doc",
"_id" : "radical-candor",
"_score" : 1.9568613,
"_source" : {
"isbn13" : "978-1250258403",
"author" : "Kim Scott",
"title" : "Radical Candor: Be a Kick-Ass Boss Without Losing Your Humanity",
"publisher" : "Macmillan",
"category" : [
"human resources",
"management",
"new work"
],
"pages" : 404,
"price" : 7.29,
"format" : "paperback",
"rating" : 4.0,
"release_year" : "2018",
"vector_recommendation" : [
4.0,
4.0,
9.2
]
}
},
{
"_index" : "books",
"_type" : "_doc",
"_id" : "kafka-the-definitive-guide",
"_score" : 1.9520907,
"_source" : {
"isbn13" : "978-1491936160",
"author" : [
"Neha Narkhede",
"Gwen Shapira",
"Todd Palino"
],
"title" : "Kafka: The Definitive Guide: Real-time data and stream processing at scale",
"publisher" : "O'Reilly",
"category" : [
"databases"
],
"pages" : 297,
"price" : 37.31,
"format" : "paperback",
"rating" : 3.9,
"release_year" : "2017",
"vector_recommendation" : [
2.97,
3.9,
6.2
]
}
},
{
"_index" : "books",
"_type" : "_doc",
"_id" : "daemon",
"_score" : 1.9394372,
"_source" : {
"isbn13" : "978-1847249616",
"author" : "Daniel Suarez",
"title" : "Daemon",
"publisher" : "Quercus",
"category" : [
"dystopia",
"novel"
],
"pages" : 448,
"price" : 12.03,
"format" : "paperback",
"rating" : 4.0,
"release_year" : "2011",
"vector_recommendation" : [
4.48,
4.0,
8.7
]
}
},
{
"_index" : "books",
"_type" : "_doc",
"_id" : "effective-java",
"_score" : 1.9305289,
"_source" : {
"isbn13" : "978-1491936160",
"author" : "Joshua Block",
"title" : "Effective Java",
"publisher" : "Addison-Wesley",
"category" : [
"programming languages",
"java"
],
"pages" : 412,
"price" : 27.91,
"format" : "paperback",
"rating" : 4.2,
"release_year" : "2017",
"vector_recommendation" : [
4.12,
4.2,
7.2
]
}
},
{
"_index" : "books",
"_type" : "_doc",
"_id" : "database-internals",
"_score" : 1.9005439,
"_source" : {
"isbn13" : "978-1492040347",
"author" : "Alexander Petrov",
"title" : "Database Internals: A deep-dive into how distributed data systems work",
"publisher" : "O'Reilly",
"category" : [
"databases",
"information systems"
],
"pages" : 350,
"price" : 47.28,
"format" : "paperback",
"rating" : 4.5,
"release_year" : "2019",
"vector_recommendation" : [
3.5,
4.5,
5.2
]
}
},
{
"_index" : "books",
"_type" : "_doc",
"_id" : "designing-data-intensive-applications",
"_score" : 1.8525991,
"_source" : {
"isbn13" : "978-1449373320",
"author" : "Martin Kleppmann",
"title" : "Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems",
"publisher" : "O'Reilly",
"category" : [
"databases"
],
"pages" : 590,
"price" : 31.06,
"format" : "paperback",
"rating" : 4.4,
"release_year" : "2017",
"vector_recommendation" : [
5.9,
4.4,
6.8
]
}
},
{
"_index" : "books",
"_type" : "_doc",
"_id" : "cryptonomicon",
"_score" : 1.7700485,
"_source" : {
"isbn13" : "978-1847249616",
"author" : "Neal Stephenson",
"title" : "Cryptonomicon",
"publisher" : "Avon",
"category" : [
"thriller",
"novel"
],
"pages" : 1152,
"price" : 6.99,
"format" : "paperback",
"rating" : 4.0,
"release_year" : "2002",
"vector_recommendation" : [
10.0,
4.0,
9.3
]
}
},
{
"_index" : "books",
"_type" : "_doc",
"_id" : "garbage-collection-handbook",
"_score" : 1.528916,
"_source" : {
"isbn13" : "978-1420082791",
"author" : [
"Richard Jones",
"Antony Hosking",
"Eliot Moss"
],
"title" : "The Garbage Collection Handbook: The Art of Automatic Memory Management",
"publisher" : "Taylor & Francis",
"category" : [
"programming algorithms"
],
"pages" : 511,
"price" : 87.85,
"format" : "paperback",
"rating" : 5.0,
"release_year" : "2011",
"vector_recommendation" : [
5.1,
5.0,
1.3
]
}
}
]
我们可以看出来 “Radical Candor: Be a Kick-Ass Boss Without Losing Your Humanity” 是最贴近的书。我们可以看一下它的recommendation_vector:
"vector_recommendation" : [
4.0,
4.0,
9.2
]
这是所有的书里最贴近我们搜索要求的书了。
参考:
【1】https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-script-score-query.html#vector-functions
推荐阅读
- 基于微信小程序带后端ssm接口小区物业管理平台设计
- 基于|基于 antd 风格的 element-table + pagination 的二次封装
- NeuVector 会是下一个爆款云原生安全神器吗()
- 基于爱,才会有“愿望”当“要求”。2017.8.12
- javaweb|基于Servlet+jsp+mysql开发javaWeb学生成绩管理系统
- JavaScript|vue 基于axios封装request接口请求——request.js文件
- 韵达基于云原生的业务中台建设 | 实战派
- EasyOA|EasyOA 基于SSM的实现 未完成总结与自我批判
- 基于stm32智能风扇|基于stm32智能风扇_一款基于STM32的智能灭火机器人设计
- stm32|基于STM32和freeRTOS智能门锁设计方案