Elasticsearch Ruby on Rails（Chewy Gem教程） _Elasticsearch

本文概述

为什么要Chewy？
Elasticsearch基本指南
Rails集成
正在搜寻
测试Elasticsearch查询
本文总结
附录：Elasticsearch内部

Elasticsearch在Apache Lucene库的基础上提供了一个强大的RESTful HTTP接口, 用于索引和查询数据。它具有开箱即用的功能, 提供UTF-8支持, 可扩展, 高效且强大。它是用于索引和查询大量结构化数据的强大工具, 在srcmini, 它为我们的平台搜索提供了强大动力, 并且很快还将用于自动完成。我们是忠实的粉丝。
Chewy扩展了Elasticsearch-Ruby客户端, 使其功能更强大, 并提供了与Rails的更紧密集成。
由于我们的平台是使用Ruby on Rails构建的, 因此我们的Elasticsearch集成利用了Elasticsearch-ruby项目(用于Elasticsearch的Ruby集成框架, 该框架提供了用于连接到Elasticsearch集群的客户端, 用于Elasticsearch的REST API的Ruby API和各种扩展程序和实用程序)。在此基础上, 我们开发并发布了对Elasticsearch应用程序搜索体系结构的改进(和简化), 该体系结构打包为Ruby gem, 并命名为Chewy(此处提供示例应用程序)。
Chewy扩展了Elasticsearch-Ruby客户端, 使其功能更强大, 并提供了与Rails的更紧密集成。在此Elasticsearch指南中, 我(通过使用示例)讨论了如何完成此任务, 包括在实施过程中出现的技术障碍。

Elasticsearch Ruby on Rails（Chewy Gem教程）

文章图片
在继续阅读指南之前, 只需要简要说明一下：

GitHub上提供了Chewy和Chewy演示应用程序。
对于那些对Elasticsearch的更多” 幕后” 信息感兴趣的人, 我将其简要介绍作为本文的附录。

为什么要Chewy？尽管Elasticsearch具有可扩展性和效率, 但将其与Rails集成并没有像预期的那么简单。在srcmini, 我们发现自己需要大大增强基本的Elasticsearch-Ruby客户端, 使其性能更高并支持其他操作。
尽管Elasticsearch具有可扩展性和效率, 但将其与Rails集成并没有像预期的那么简单。
因此, Chewy的gem诞生了。
Chewy的一些特别值得注意的功能包括：

每个索引都可以由所有相关模型观察到。
大多数索引模型彼此相关。有时, 有必要对这些相关数据进行非规范化, 然后将其绑定到同一对象(例如, 如果你想将标签数组及其相关文章一起编入索引)。 Chewy允许你为每个模型指定一个可更新的索引, 因此只要相关标签更新, 相应的文章就会重新索引。
索引类独立于ORM / ODM模型。
借助此增强功能, 例如, 实现跨模型自动补全变得更加容易。你可以只定义索引并以面向对象的方式使用它。与其他客户端不同, Chewy gem无需手动实现索引类, 数据导入回调和其他组件。
批量导入无处不在。
Chewy利用批量Elasticsearch API进行完整的重新索引和索引更新。它还利用了原子更新的概念, 在原子块中收集已更改的对象, 然后一次全部更新它们。
Chewy提供了一种AR风格的查询DSL。
通过可链接, 可合并和惰性, 此增强功能允许以更有效的方式生成查询。

好吧, 让我们看看这一切在gem中如何发挥作用……
Elasticsearch基本指南 Elasticsearch具有几个与文档相关的概念。第一个是索引(RDBMS中数据库的类似物)的索引, 它由一组文档组成, 可以是几种类型(其中一种是RDBMS表的类型)。
每个文档都有一组字段。每个字段都是独立分析的, 其分析选项针对其类型存储在映射中。 Chewy在其对象模型中” 按原样” 利用了这种结构：

class EntertainmentIndex < Chewy::Index settings analysis: { analyzer: { title: { tokenizer: 'standard', filter: ['lowercase', 'asciifolding'] } } }define_type Book.includes(:author, :tags) do field :title, analyzer: 'title' field :year, type: 'integer' field :author, value: -> { author.name } field :author_id, type: 'integer' field :description field :tags, index: 'not_analyzed', value: -> { tags.map(& :name) } end{movie: Video.movies, cartoon: Video.cartoons}.each do |type_name, scope| define_type scope.includes(:director, :tags), name: type_name do field :title, analyzer: 'title' field :year, type: 'integer' field :author, value: -> { director.name } field :author_id, type: 'integer', value: -> { director_id } field :description field :tags, index: 'not_analyzed', value: -> { tags.map(& :name) } end end end

上面, 我们用三种类型定义了一种称为娱乐的Elasticsearch索引：书籍, 电影和卡通。对于每种类型, 我们为整个索引定义了一些字段映射和设置的哈希值。
因此, 我们定义了EntertainmentIndex, 并希望执行一些查询。第一步, 我们需要创建索引并导入数据：

EntertainmentIndex.create! EntertainmentIndex.import # EntertainmentIndex.reset! (which includes deletion, # creation, and import) could be used instead

.import方法知道导入的数据, 因为在定义类型时我们传入了范围。因此, 它将导入持久性存储中存储的所有书籍, 电影和动画片。
完成后, 我们可以执行一些查询：

EntertainmentIndex.query(match: {author: 'Tarantino'}).filter{ year > 1990 } EntertainmentIndex.query(match: {title: 'Shawshank'}).types(:movie) EntertainmentIndex.query(match: {author: 'Tarantino'}).only(:id).limit(10).load # the last one loads ActiveRecord objects for documents found

现在, 我们的索引几乎可以在我们的搜索实现中使用了。
Rails集成为了与Rails集成, 我们需要做的第一件事就是能够对RDBMS对象更改做出反应。 Chewy通过在update_index类方法中定义的回调来支持此行为。 update_index有两个参数：

以” index_name＃type_name” 格式提供的类型标识符
要执行的方法名称或块, 表示对更新的对象或对象集合的反向引用

我们需要为每个依赖模型定义这些回调：

class Book < ActiveRecord::Base acts_as_taggablebelongs_to :author, class_name: 'Dude' # We update the book itself on-change update_index 'entertainment#book', :self endclass Video < ActiveRecord::Base acts_as_taggablebelongs_to :director, class_name: 'Dude' # Update video types when changed, depending on the category update_index('entertainment#movie') { self if movie? } update_index('entertainment#cartoon') { self if cartoon? } endclass Dude < ActiveRecord::Base acts_as_taggablehas_many :books has_many :videos # If author or director was changed, all the corresponding # books, movies and cartoons are updated update_index 'entertainment#book', :books update_index('entertainment#movie') { videos.movies } update_index('entertainment#cartoon') { videos.cartoons } end

由于还对标签进行了索引, 因此我们接下来需要对一些外部模型进行猴子补丁, 以便它们对更改做出反应：

ActsAsTaggableOn::Tag.class_eval do has_many :books, through: :taggings, source: :taggable, source_type: 'Book' has_many :videos, through: :taggings, source: :taggable, source_type: 'Video'# Updating all tag-related objects update_index 'entertainment#book', :books update_index('entertainment#movie') { videos.movies } update_index('entertainment#cartoon') { videos.cartoons } endActsAsTaggableOn::Tagging.class_eval do # Same goes for the intermediate model update_index('entertainment#book') { taggable if taggable_type == 'Book' } update_index('entertainment#movie') { taggable if taggable_type == 'Video' & & taggable.movie? } update_index('entertainment#cartoon') { taggable if taggable_type == 'Video' & & taggable.cartoon? } end

此时, 每个保存或销毁的对象都会更新相应的Elasticsearch索引类型。
原子性
我们仍然有一个挥之不去的问题。如果我们执行books.map(＆：save)之类的操作来保存多本书, 则每次保存一本书时, 我们都会请求更新娱乐索引。因此, 如果我们保存五本书, 则将对Chewy索引进行五次更新。此行为对于REPL是可以接受的, 但对于性能至关重要的控制器操作则肯定是不可接受的。
我们使用Chewy.atomic块解决此问题：

class ApplicationController < ActionController::Base around_action { |& block| Chewy.atomic(& block) } end

简而言之, Chewy.atomic按以下方式批处理这些更新：

禁用after_save回调。
收集已保存书籍的ID。
Chewy.atomic块完成后, 使用收集的ID发出单个Elasticsearch索引更新请求。

正在搜寻现在, 我们准备实现搜索界面。由于我们的用户界面是一种表单, 因此构建它的最佳方法当然是使用FormBuilder和ActiveModel。 (在srcmini, 我们使用ActiveData来实现ActiveModel接口, 但可以随意使用你喜欢的gem。)

class EntertainmentSearch include ActiveData::Modelattribute :query, type: String attribute :author_id, type: Integer attribute :min_year, type: Integer attribute :max_year, type: Integer attribute :tags, mode: :arrayed, type: String, normalize: -> (value) { value.reject(& :blank?) }# This accessor is for the form. It will have a single text field # for comma-separated tag inputs. def tag_list= value self.tags = value.split(', ').map(& :strip) enddef tag_list self.tags.join(', ') end end

查询和过滤器教程
现在我们有了一个类似于ActiveModel的对象, 可以接受和类型转换属性, 让我们实现搜索：

class EntertainmentSearch ...def index EntertainmentIndex enddef search # We can merge multiple scopes [query_string, author_id_filter, year_filter, tags_filter].compact.reduce(:merge) end# Using query_string advanced query for the main query input def query_string index.query(query_string: {fields: [:title, :author, :description], query: query, default_operator: 'and'}) if query? end# Simple term filter for author id. `:author_id` is already # typecasted to integer and ignored if empty. def author_id_filter index.filter(term: {author_id: author_id}) if author_id? end# For filtering on years, we will use range filter. # Returns nil if both min_year and max_year are not passed to the model. def year_filter body = {}.tap do |body| body.merge!(gte: min_year) if min_year? body.merge!(lte: max_year) if max_year? end index.filter(range: {year: body}) if body.present? end# Same goes for `author_id_filter`, but `terms` filter used. # Returns nil if no tags passed in. def tags_filter index.filter(terms: {tags: tags}) if tags? end end

控制器和视图
此时, 我们的模型可以执行带有传递属性的搜索请求。用法如下所示：

EntertainmentSearch.new(query: 'Tarantino', min_year: 1990).search

请注意, 在控制器中, 我们要加载精确的ActiveRecord对象, 而不是Chewy文档包装器：

class EntertainmentController < ApplicationController def index @search = EntertainmentSearch.new(params[:search]) # In case we want to load real objects, we don't need any other # fields except for `:id` retrieved from Elasticsearch index. # Chewy query DSL supports Kaminari gem and corresponding API. # Also, we pass scopes for every requested type to the `load` method. @entertainments = @search.search.only(:id).page(params[:page]).load( book: {scope: Book.includes(:author)}, movie: {scope: Video.includes(:director)}, cartoon: {scope: Video.includes(:director)} ) end end

现在, 是时候在Entertainment / index.html.haml上编写一些HAML了：

= form_for @search, as: :search, url: entertainment_index_path, method: :get do |f| = f.text_field :query = f.select :author_id, Dude.all.map { |d| [d.name, d.id] }, include_blank: true = f.text_field :min_year = f.text_field :max_year = f.text_field :tag_list = f.submit- if @entertainments.any? %dl - @entertainments.each do |entertainment| %dt %h1= entertainment.title %strong= entertainment.class %dd %p= entertainment.year %p= entertainment.description %p= entertainment.tag_list = paginate @entertainments - else Nothing to see here

排序
另外, 我们还将在搜索功能中添加排序功能。
假设我们需要对标题和年份字段以及相关性进行排序。不幸的是, 标题” 一只杜鹃巢上的飞” 将被拆分成单独的术语, 因此按这些完全不同的术语进行排序将太随意了。相反, 我们想按整个标题排序。
解决方案是使用特殊的标题字段并应用自己的分析器：

class EntertainmentIndex < Chewy::Index settings analysis: { analyzer: { ... sorted: { # `keyword` tokenizer will not split our titles and # will produce the whole phrase as the term, which # can be sorted easily tokenizer: 'keyword', filter: ['lowercase', 'asciifolding'] } } }define_type Book.includes(:author, :tags) do # We use the `multi_field` type to add `title.sorted` field # to the type mapping. Also, will still use just the `title` # field for search. field :title, type: 'multi_field' do field :title, index: 'analyzed', analyzer: 'title' field :sorted, index: 'analyzed', analyzer: 'sorted' end ... end{movie: Video.movies, cartoon: Video.cartoons}.each do |type_name, scope| define_type scope.includes(:director, :tags), name: type_name do # For videos as well field :title, type: 'multi_field' do field :title, index: 'analyzed', analyzer: 'title' field :sorted, index: 'analyzed', analyzer: 'sorted' end ... end end end

此外, 我们还将在搜索模型中添加以下新属性和排序处理步骤：

class EntertainmentSearch # we are going to use `title.sorted` field for sort SORT = {title: {'title.sorted' => :asc}, year: {year: :desc}, relevance: :_score} ... attribute :sort, type: String, enum: %w(title year relevance), default_blank: 'relevance' ... def search # we have added `sorting` scope to merge list [query_string, author_id_filter, year_filter, tags_filter, sorting].compact.reduce(:merge) enddef sorting # We have one of the 3 possible values in `sort` attribute # and `SORT` mapping returns actual sorting expression index.order(SORT[sort.to_sym]) end end

最后, 我们将修改表单, 添加排序选项选择框：

= form_for @search, as: :search, url: entertainment_index_path, method: :get do |f| ... / `EntertainmentSearch.sort_values` will just return / enum option content from the sort attribute definition. = f.select :sort, EntertainmentSearch.sort_values ...

错误处理
如果你的用户执行不正确的查询, 例如(或AND), Elasticsearch客户端将引发错误。要处理此错误, 请对控制器进行一些更改：

class EntertainmentController < ApplicationController def index @search = EntertainmentSearch.new(params[:search]) @entertainments = @search.search.only(:id).page(params[:page]).load( book: {scope: Book.includes(:author)}, movie: {scope: Video.includes(:director)}, cartoon: {scope: Video.includes(:director)} ) rescue Elasticsearch::Transport::Transport::Errors::BadRequest => e @entertainments = [] @error = e.message.match(/QueryParsingException\[([^; ]+)\]/).try(:[], 1) end end

此外, 我们需要在视图中呈现错误：

... - if @entertainments.any? ... - else - if @error = @error - else Nothing to see here

测试Elasticsearch查询基本测试设置如下：

启动Elasticsearch服务器。
清理并创建我们的索引。
导入我们的数据。
执行我们的查询。
将结果与我们的期望交叉引用。

对于第1步, 可以使用elasticsearch-extensions gem中定义的测试集群。只需将以下行添加到项目的Rakefilegem后安装中：

require 'elasticsearch/extensions/test/cluster/tasks'

然后, 你将获得以下Rake任务：

$ rake -T elasticsearch rake elasticsearch:start# Start Elasticsearch cluster for tests rake elasticsearch:stop# Stop Elasticsearch cluster for tests

Elasticsearch和Rspec
首先, 我们需要确保索引已更新为与数据更改同步。幸运的是, Chewy的gem带有有用的update_index rspec匹配器：

describe EntertainmentIndex do # No need to cleanup Elasticsearch as requests are # stubbed in case of `update_index` matcher usage. describe 'Tag' do # We create several books with the same tag let(:books) { create_list :book, 2, tag_list: 'tag1' }specify do # We expect that after modifying the tag name... expect do ActsAsTaggableOn::Tag.where(name: 'tag1').update_attributes(name: 'tag2') # ... the corresponding type will be updated with previously-created books. end.to update_index('entertainment#book').and_reindex(books, with: {tags: ['tag2']}) end end end

接下来, 我们需要测试实际的搜索查询是否正确执行, 并返回预期结果：

describe EntertainmentSearch do # Just defining helpers for simplifying testing def search attributes = {} EntertainmentSearch.new(attributes).search end# Import helper as well def import *args # We are using `import!` here to be sure all the objects are imported # correctly before examples run. EntertainmentIndex.import! *args end# Deletes and recreates index before every example before { EntertainmentIndex.purge! }describe '#min_year, #max_year' do let(:book) { create(:book, year: 1925) } let(:movie) { create(:movie, year: 1970) } let(:cartoon) { create(:cartoon, year: 1995) } before { import book: book, movie: movie, cartoon: cartoon }# NOTE:The sample code below provides a clear usage example but is not # optimized code.Something along the following lines would perform better: # `specify { search(min_year: 1970).map(& :id).map(& :to_i) #.should =~ [movie, cartoon].map(& :id) }` specify { search(min_year: 1970).load.should =~ [movie, cartoon] } specify { search(max_year: 1980).load.should =~ [book, movie] } specify { search(min_year: 1970, max_year: 1980).load.should == [movie] } specify { search(min_year: 1980, max_year: 1970).should == [] } end end

测试群集故障排除
最后, 这是对测试群集进行故障排除的指南：

首先, 请使用内存中的单节点群集。规格将更快。在我们的情况下：TEST_CLUSTER_NODES = 1 rake elasticsearch：start
elasticsearch-extensions测试群集实施本身存在一些与单节点群集状态检查相关的问题(在某些情况下为黄色, 永远不会变为绿色, 因此绿色状态群集启动检查每次都会失败)。该问题已通过叉子修复, 但希望它将很快在主存储库中修复。
对于每个数据集, 请将你的请求按规范分组(即, 一次导入你的数据, 然后执行多个请求)。 Elasticsearch会长时间预热, 并且在导入数据时会占用大量堆内存, 因此请不要过度使用它, 尤其是当你有很多规格时。
确保你的计算机有足够的内存, 否则Elasticsearch将冻结(每个测试虚拟机大约需要5GB, Elasticsearch本身大约需要1GB)。

本文总结 Elasticsearch自称为” 一个灵活而强大的开源, 分布式, 实时搜索和分析引擎。” 这是搜索技术的黄金标准。
借助Chewy, 我们的Rails开发人员将这些优势打包为简单, 易于使用, 生产质量的开源Ruby gem, 它提供了与Rails的紧密集成。 Elasticsearch和Rails –太棒了！
Elasticsearch和Rails-太棒了！
鸣叫
附录：Elasticsearch内部这是” 内部” 对Elasticsearch的非常简短的介绍…
Elasticsearch基于Lucene构建, Lucene本身使用倒排索引作为其主要数据结构。例如, 如果我们有字符串” 狗跳得很高” , “ 越过篱笆” 和” 篱笆太高” , 则得到以下结构：

"the"[0, 0], [1, 2], [2, 0] "dogs"[0, 1] "jump"[0, 2], [1, 0] "high"[0, 3], [2, 4] "over"[1, 1] "fence"[1, 3], [2, 1] "was"[2, 2] "too"[2, 3]

因此, 每个术语都包含对文本的引用和在文本中的位置。此外, 我们选择修改术语(例如, 删除” the” 之类的停用词), 并对每个术语应用语音哈希(你能猜出算法吗？)：

"DAG"[0, 1] "JANP"[0, 2], [1, 0] "HAG"[0, 3], [2, 4] "OVAR"[1, 1] "FANC"[1, 3], [2, 1] "W"[2, 2] "T"[2, 3]

如果我们随后查询” 狗跳” , 则会以与源文本相同的方式进行分析, 在散列后变为” DAG JANP” (“ 狗” 与” 狗” 具有相同的散列, “ 跳” 和” “ 跳” )。
【Elasticsearch Ruby on Rails（Chewy Gem教程）】我们还在字符串中的各个单词之间添加了一些逻辑(基于配置设置), 在(” DAG” 和” JANP” )或(” DAG” 或” JANP” )之间进行选择。前者返回[0]＆[0, 1](即文档0)的交集, 而后者返回[0] | [0]。 [0, 1](即文档0和1)。文本位置可用于对结果评分和与位置相关的查询。