从查询重写角度理解elasticsearch的高亮原理

一、高亮的一些问题
elasticsearch提供了三种高亮方式,前面我们已经简单的了解了elasticsearch的高亮原理; 高亮处理跟实际使用查询类型有十分紧密的关系,其中主要的一点就是muti term 查询的重写,例如wildcard、prefix等,由于查询本身和高亮都涉及到查询语句的重写,如果两者之间的重写机制不同,那么就可能会碰到以下情况
相同的查询语句, 使用unified和fvh得到的高亮结果是不同的,甚至fvh Highlighter无任何高亮信息返回;
二、数据环境
elasticsearch 8.0

PUT highlight_test { "mappings": { "properties": { "text":{ "type": "text", "term_vector": "with_positions_offsets" } } }, "settings": { "number_of_replicas":0, "number_of_shards": 1 } }PUT highlight_test/_doc/1 { "name":"mango", "text":"my name is mongo, i am test hightlight in elastic search" }

三、muti term查询重写简介
所谓muti term查询就是查询中并不是明确的关键字,而是需要elasticsearch重写查询语句,进一步明确关键字;以下查询会涉及到muti term查询重写;
fuzzy prefix query_string regexp wildcard

以上查询都支持rewrite参数,最终将查询重写为bool查询或者bitset;
查询重写主要影响以下几方面
重写需要抓取哪些关键字以及抓取的数量;
抓取关键字的相关性计算方式;
查询重写支持以下参数选项
constant_score,默认值,如果需要抓取的关键字比较少,则重写为bool查询,否则抓取所有的关键字并重写为bitset;直接使用boost参数作为文档score,一般term level的查询的boost默认值为1;
constant_score_boolean,将查询重写为bool查询,并使用boost参数作为文档的score,受到indices.query.bool.max_clause_count 限制,所以默认最多抓取1024个关键字;
scoring_boolean,将查询重写为bool查询,并计算文档的相对权重,受到indices.query.bool.max_clause_count 限制,所以默认最多抓取1024个关键字;
top_terms_blended_freqs_N,抓取得分最高的前N个关键字,并将查询重写为bool查询;此选项不受indices.query.bool.max_clause_count 限制;选择命中文档的所有关键字中权重最大的作为文档的score;
top_terms_boost_N,抓取得分最高的前N个关键字,并将查询重写为bool查询;此选项不受indices.query.bool.max_clause_count 限制;直接使用boost作为文档的score;
top_terms_N,抓取得分最高的前N个关键字,并将查询重写为bool查询;此选项不受indices.query.bool.max_clause_count 限制;计算命中文档的相对权重作为评分;
三、wildcard查询重写分析
我们通过elasticsearch来查看一下以下查询语句的重写逻辑;
{ "query":{ "wildcard":{ "text":{ "value":"m*" } } } }

通过查询使用的字段映射类型构建WildCardQuery,并使用查询语句中配置的rewrite对应的MultiTermQuery.RewriteMethod;
//WildcardQueryBuilder.java @Override protected Query doToQuery(SearchExecutionContext context) throws IOException { MappedFieldType fieldType = context.getFieldType(fieldName); if (fieldType == null) { throw new IllegalStateException("Rewrite first"); }MultiTermQuery.RewriteMethod method = QueryParsers.parseRewriteMethod(rewrite, null, LoggingDeprecationHandler.INSTANCE); return fieldType.wildcardQuery(value, method, caseInsensitive, context); }

根据查询语句中配置的rewrite,查找对应的MultiTermQuery.RewriteMethod,由于我们没有在wildcard查询语句中设置rewrite参数,这里直接返回null;
//QueryParsers.java public static MultiTermQuery.RewriteMethod parseRewriteMethod( @Nullable String rewriteMethod, @Nullable MultiTermQuery.RewriteMethod defaultRewriteMethod, DeprecationHandler deprecationHandler ) { if (rewriteMethod == null) { return defaultRewriteMethod; } if (CONSTANT_SCORE.match(rewriteMethod, deprecationHandler)) { return MultiTermQuery.CONSTANT_SCORE_REWRITE; } if (SCORING_BOOLEAN.match(rewriteMethod, deprecationHandler)) { return MultiTermQuery.SCORING_BOOLEAN_REWRITE; } if (CONSTANT_SCORE_BOOLEAN.match(rewriteMethod, deprecationHandler)) { return MultiTermQuery.CONSTANT_SCORE_BOOLEAN_REWRITE; }int firstDigit = -1; for (int i = 0; i < rewriteMethod.length(); ++i) { if (Character.isDigit(rewriteMethod.charAt(i))) { firstDigit = i; break; } }if (firstDigit >= 0) { final int size = Integer.parseInt(rewriteMethod.substring(firstDigit)); String rewriteMethodName = rewriteMethod.substring(0, firstDigit); if (TOP_TERMS.match(rewriteMethodName, deprecationHandler)) { return new MultiTermQuery.TopTermsScoringBooleanQueryRewrite(size); } if (TOP_TERMS_BOOST.match(rewriteMethodName, deprecationHandler)) { return new MultiTermQuery.TopTermsBoostOnlyBooleanQueryRewrite(size); } if (TOP_TERMS_BLENDED_FREQS.match(rewriteMethodName, deprecationHandler)) { return new MultiTermQuery.TopTermsBlendedFreqScoringRewrite(size); } }throw new IllegalArgumentException("Failed to parse rewrite_method [" + rewriteMethod + "]"); } }

WildCardQuery继承MultiTermQuery,直接调用rewrite方法进行重写,由于我们没有在wildcard查询语句中设置rewrite参数,这里直接使用默认的CONSTANT_SCORE_REWRITE;
//MultiTermQuery.java protected RewriteMethod rewriteMethod = CONSTANT_SCORE_REWRITE; @Override public final Query rewrite(IndexReader reader) throws IOException { return rewriteMethod.rewrite(reader, this); }

可以看到CONSTANT_SCORE_REWRITE是直接使用的匿名类,rewrite方法返回的是MultiTermQueryConstantScoreWrapper的实例;
//MultiTermQuery.java public static final RewriteMethod CONSTANT_SCORE_REWRITE = new RewriteMethod() { @Override public Query rewrite(IndexReader reader, MultiTermQuery query) { return new MultiTermQueryConstantScoreWrapper<>(query); } };

在以下方法中,首先会得到查询字段对应的所有term集合;
然后通过 query.getTermsEnum获取跟查询匹配的所有term集合;
最后根据collectTerms调用的返回值决定是否构建bool查询还是bit set;
//MultiTermQueryConstantScoreWrapper.java private WeightOrDocIdSet rewrite(LeafReaderContext context) throws IOException { final Terms terms = context.reader().terms(query.field); if (terms == null) { // field does not exist return new WeightOrDocIdSet((DocIdSet) null); }final TermsEnum termsEnum = query.getTermsEnum(terms); assert termsEnum != null; PostingsEnum docs = null; final List collectedTerms = new ArrayList<>(); if (collectTerms(context, termsEnum, collectedTerms)) { // build a boolean query BooleanQuery.Builder bq = new BooleanQuery.Builder(); for (TermAndState t : collectedTerms) { final TermStates termStates = new TermStates(searcher.getTopReaderContext()); termStates.register(t.state, context.ord, t.docFreq, t.totalTermFreq); bq.add(new TermQuery(new Term(query.field, t.term), termStates), Occur.SHOULD); } Query q = new ConstantScoreQuery(bq.build()); final Weight weight = searcher.rewrite(q).createWeight(searcher, scoreMode, score()); return new WeightOrDocIdSet(weight); }// Too many terms: go back to the terms we already collected and start building the bit set DocIdSetBuilder builder = new DocIdSetBuilder(context.reader().maxDoc(), terms); if (collectedTerms.isEmpty() == false) { TermsEnum termsEnum2 = terms.iterator(); for (TermAndState t : collectedTerms) { termsEnum2.seekExact(t.term, t.state); docs = termsEnum2.postings(docs, PostingsEnum.NONE); builder.add(docs); } }// Then keep filling the bit set with remaining terms do { docs = termsEnum.postings(docs, PostingsEnum.NONE); builder.add(docs); } while (termsEnum.next() != null); return new WeightOrDocIdSet(builder.build()); }

调用collectTerms默认只会提取查询命中的16个关键字;
//MultiTermQueryConstantScoreWrapper.java private static final int BOOLEAN_REWRITE_TERM_COUNT_THRESHOLD = 16; private boolean collectTerms( LeafReaderContext context, TermsEnum termsEnum, List terms) throws IOException { final int threshold = Math.min(BOOLEAN_REWRITE_TERM_COUNT_THRESHOLD, IndexSearcher.getMaxClauseCount()); for (int i = 0; i < threshold; ++i) { final BytesRef term = termsEnum.next(); if (term == null) { return true; } TermState state = termsEnum.termState(); terms.add( new TermAndState( BytesRef.deepCopyOf(term), state, termsEnum.docFreq(), termsEnum.totalTermFreq())); } return termsEnum.next() == null; }

通过以上分析wildcard查询默认情况下,会提取字段中所有命中查询的关键字;
四、fvh Highlighter中wildcard的查询重写
在muti term query中,提取查询关键字是高亮逻辑一个很重要的步骤;
我们使用以下高亮语句,分析以下高亮中提取查询关键字过程中的查询重写;
{ "query":{ "wildcard":{ "text":{ "value":"m*" } } }, "highlight":{ "fields":{ "text":{ "type":"fvh" } } } }

默认情况下只有匹配的字段才会进行高亮,这里构建CustomFieldQuery;
//FastVectorHighlighter.java if (field.fieldOptions().requireFieldMatch()) { /* * we use top level reader to rewrite the query against all readers, * with use caching it across hits (and across readers...) */ entry.fieldMatchFieldQuery = new CustomFieldQuery( fieldContext.query, hitContext.topLevelReader(), true, field.fieldOptions().requireFieldMatch() ); }

通过调用flatten方法得到重写之后的flatQueries,然后将每个提取的关键字重写为BoostQuery;
//FieldQuery.java public FieldQuery(Query query, IndexReader reader, boolean phraseHighlight, boolean fieldMatch) throws IOException { this.fieldMatch = fieldMatch; Set flatQueries = new LinkedHashSet<>(); flatten(query, reader, flatQueries, 1f); saveTerms(flatQueries, reader); Collection expandQueries = expand(flatQueries); for (Query flatQuery : expandQueries) { QueryPhraseMap rootMap = getRootMap(flatQuery); rootMap.add(flatQuery, reader); float boost = 1f; while (flatQuery instanceof BoostQuery) { BoostQuery bq = (BoostQuery) flatQuery; flatQuery = bq.getQuery(); boost *= bq.getBoost(); } if (!phraseHighlight && flatQuery instanceof PhraseQuery) { PhraseQuery pq = (PhraseQuery) flatQuery; if (pq.getTerms().length > 1) { for (Term term : pq.getTerms()) rootMap.addTerm(term, boost); } } } }

由于WildCardQuery是MultiTermQuery的子类,所以在flatten方法中最终直接使用MultiTermQuery.TopTermsScoringBooleanQueryRewrite进行查询重写,这里的top N是MAX_MTQ_TERMS = 1024;
//FieldQuery.javaprivate static final int MAX_MTQ_TERMS = 1024; protected void flatten( Query sourceQuery, IndexReader reader, Collection flatQueries, float boost) throws IOException {.................................. ..................................else if (reader != null) { Query query = sourceQuery; Query rewritten; if (sourceQuery instanceof MultiTermQuery) { rewritten = new MultiTermQuery.TopTermsScoringBooleanQueryRewrite(MAX_MTQ_TERMS) .rewrite(reader, (MultiTermQuery) query); } else { rewritten = query.rewrite(reader); } if (rewritten != query) { // only rewrite once and then flatten again - the rewritten query could have a speacial // treatment // if this method is overwritten in a subclass. flatten(rewritten, reader, flatQueries, boost); } // if the query is already rewritten we discard it } // else discard queries }

这里首先计算设置的size和getMaxSize(默认值1024, IndexSearcher.getMaxClauseCount())计算最终提取的命中关键字数量,这里最终是1024个;
【从查询重写角度理解elasticsearch的高亮原理】这里省略了传入collectTerms的TermCollector匿名子类的实现,其余最终提取关键字数量有关;
//FieldQuery.java@Override public final Query rewrite(final IndexReader reader, final MultiTermQuery query) throws IOException { final int maxSize = Math.min(size, getMaxSize()); final PriorityQueue stQueue = new PriorityQueue<>(); collectTerms( reader, query, new TermCollector() {................}); ............. return build(b); }

这里首先获取查询字段对应的所有term集合,然后获取所有的与查询匹配的term集合,最终通过传入的collector提取关键字;
//TermCollectingRewrite.java final void collectTerms(IndexReader reader, MultiTermQuery query, TermCollector collector) throws IOException { IndexReaderContext topReaderContext = reader.getContext(); for (LeafReaderContext context : topReaderContext.leaves()) { final Terms terms = context.reader().terms(query.field); if (terms == null) { // field does not exist continue; }final TermsEnum termsEnum = getTermsEnum(query, terms, collector.attributes); assert termsEnum != null; if (termsEnum == TermsEnum.EMPTY) continue; collector.setReaderContext(topReaderContext, context); collector.setNextEnum(termsEnum); BytesRef bytes; while ((bytes = termsEnum.next()) != null) { if (!collector.collect(bytes)) return; // interrupt whole term collection, so also don't iterate other subReaders } } }

这里通过控制最终提取匹配查询的关键字的数量不超过maxSize;
//TopTermsRewrite.java @Override public boolean collect(BytesRef bytes) throws IOException { final float boost = boostAtt.getBoost(); // make sure within a single seg we always collect // terms in order assert compareToLastTerm(bytes); // System.out.println("TTR.collect term=" + bytes.utf8ToString() + " boost=" + boost + " // ord=" + readerContext.ord); // ignore uncompetitive hits if (stQueue.size() == maxSize) { final ScoreTerm t = stQueue.peek(); if (boost < t.boost) return true; if (boost == t.boost && bytes.compareTo(t.bytes.get()) > 0) return true; } ScoreTerm t = visitedTerms.get(bytes); final TermState state = termsEnum.termState(); assert state != null; if (t != null) { // if the term is already in the PQ, only update docFreq of term in PQ assert t.boost == boost : "boost should be equal in all segment TermsEnums"; t.termState.register( state, readerContext.ord, termsEnum.docFreq(), termsEnum.totalTermFreq()); } else { // add new entry in PQ, we must clone the term, else it may get overwritten! st.bytes.copyBytes(bytes); st.boost = boost; visitedTerms.put(st.bytes.get(), st); assert st.termState.docFreq() == 0; st.termState.register( state, readerContext.ord, termsEnum.docFreq(), termsEnum.totalTermFreq()); stQueue.offer(st); // possibly drop entries from queue if (stQueue.size() > maxSize) { st = stQueue.poll(); visitedTerms.remove(st.bytes.get()); st.termState.clear(); // reset the termstate! } else { st = new ScoreTerm(new TermStates(topReaderContext)); } assert stQueue.size() <= maxSize : "the PQ size must be limited to maxSize"; // set maxBoostAtt with values to help FuzzyTermsEnum to optimize if (stQueue.size() == maxSize) { t = stQueue.peek(); maxBoostAtt.setMaxNonCompetitiveBoost(t.boost); maxBoostAtt.setCompetitiveTerm(t.bytes.get()); } }return true; }

通过以上分析可以看到,fvh Highlighter对multi term query的重写,直接使用MultiTermQuery.TopTermsScoringBooleanQueryRewrite,并限制只能最多提取查询关键字1024个;
五、重写可能导致的高亮问题原因分析
经过以上对查询和高亮的重写过程分析可以知道,默认情况下
query阶段提取的是命中查询的所有的关键字,具体行为可以通过rewrite参数进行定制;
Highlight阶段提取的是命中查询的关键字中的前1024个,具体行为不受rewrite参数的控制;
如果查询的字段是大文本字段,导致字段的关键字很多,就可能会出现查询命中的文档的关键字不在前1024个里边,从而导致明明匹配了文档,但是却没有返回高亮信息;
六、解决方案
  1. 进一步明确查询关键字,减少查询命中的关键字的数量,例如输入更多的字符,;
  2. 使用其他类型的查询替换multi term query;

    推荐阅读