利用elasticsearch的morelikethis功能实现基于内容的相关度推荐

以前给毒app做了一个资讯推荐功能,基于内容的相关度推荐,大致步骤罗列一下主要分以下几个步骤:
1 对title和content分词提取关键词(TF-IDF),title的权重比content权重要高,权重规则如下:

private double getWeight(Term term, int length, int titleLength) { if (term.getName().trim().length() < 2) { return 0; }String pos = term.natrue().natureStr; Double posScore = POS_SCORE.get(pos); if (posScore == null) { posScore = 1.0; } else if (posScore == 0) { return 0; } //term.getOffe()小于titleLength,说明这个term出现在标题部分,出现在标题部分的词按词性权重的5倍计算 if (titleLength > term.getOffe()) { return 5 * posScore; } //否则这个term是出现在内容部分,出现在内容部分的权重是按term出现位置做一个权重的折扣,该term在内容中出现的位置越靠后,得分越低。 return (length - term.getOffe()) * posScore / (double) length; }


2 把每一篇资讯映射成关键词的向量(稀疏向量)表示,用hashing方式把关键词的得分映射到N维稀疏向量的某个维上,即关键词向量上其实保存的是计算出来的关键词的得分。把所有资讯向量构成一个稀疏矩阵,mahout中有SparseRowMatrix。
3 求两两向量之间的余弦相似度,这个公式就不罗列了,百度一下。然后topN得到推荐结果。


虽然也实现了内容相似度推荐,弄的过程还是蛮复杂的,整个推荐模型的可控制部分很少,而且还要提供自定义词库改善分词关键词提取效果,而且还要单独部署计算提供推荐结果。


最近看es的文档时,发现有一个morelikethis搜索功能(https://www.elastic.co/guide/en/elasticsearch/reference/1.7/query-dsl-mlt-query.html),也是实现基于内容的相关度推荐,于是赶紧实践了一下。
首先建立了一个index,配置了一些analyzer:

PUT /du_app_v1 { "settings": { "number_of_shards": 2, "number_of_replicas": 0, "analysis": { "tokenizer": { "comma": { "pattern": ",", "type": "pattern" } }, "analyzer": { "comma": { "type": "custom", "tokenizer": "comma" }, "index_ansj": { "type": "custom", "char_filter": "html_strip", "tokenizer": "ansj_index_token" }, "query_ansj": { "type": "custom", "char_filter": "html_strip", "tokenizer": "ansj_query_token" } } } }, "mappings": { "_default_": { "_all": {"enabled": false} } } }

然后根据原来的推荐逻辑需要的字段,创建了一个type:news

PUT /du_app_v1/_mapping/news { "properties": { "title": { "type": "string", "term_vector": "with_positions_offsets", "index_analyzer": "index_ansj", "search_analyzer": "query_ansj" }, "content": { "type": "string", "term_vector": "with_positions_offsets", "index_analyzer": "index_ansj", "search_analyzer": "query_ansj" }, "source": { "type": "string", "term_vector": "with_positions_offsets", "index": "not_analyzed" }, "cate": { "type": "string", "term_vector": "with_positions_offsets", "index": "not_analyzed" }, "addTime": { "type": "date", "format": "yyy-MM-dd HH:mm:ss" } } }


导入数据之后,直接就可以利用morelikethis 查询语句来进行内容推荐了:

GET /du_app_v1/news/_search { "from": 0, "size": 3, "_source": ["newsId","addTime","title"], "query": { "more_like_this": { "fields": [ "title", "content", "cate", "source" ], "ids": [ "1280" ] } } }


当然还可以利用rescore功能做一下时间惩罚,让最近的资讯相应的得分更高一点:

GET /du_app_v1/news/_search { "from": 0, "size": 3, "_source": ["newsId","addTime","title"], "query": { "more_like_this": { "fields": [ "title", "content", "cate", "source" ], "ids": [ "1280" ] } }, "rescore": { "window_size": 10, "query": { "score_mode": "multiply", "rescore_query": { "function_score": { "script_score": { "script": "(1.0/(1.0+(DateTime.now().getMillis() - doc['addTime'].date.getMillis())/86400000/30.0))" } } } } } }


试了几个资讯,搜出来的相关度还是蛮高的,感觉比我之间自己弄的要好许多,而且相关度计算上可控的参数很多,可以根据需要来调整。
需要注意的一点是字段的term_vector需要保存( "term_vector": "with_positions_offsets",),用空间换时间,这样查询的时候性能会好很多,当然不存的也是可以的,只是每次查询都会去生成term vector.还有就是去除html标签和移除停用词。让生成出来的term vector更精准。由于也是基于TF-IDF的方法,所以词频和文档频率等等这些控制参数还是蛮有用的,可以根据实际情况调整,一般默认都够用了:

max_query_terms
The maximum number of query terms that will be selected. Increasing this value gives greater accuracy at the expense of query execution speed. Defaults to 25.
min_term_freq
The minimum term frequency below which the terms will be ignored from the input document. Defaults to 2.
min_doc_freq
The minimum document frequency below which the terms will be ignored from the input document. Defaults to 5.
max_doc_freq
The maximum document frequency above which the terms will be ignored from the input document. This could be useful in order to ignore highly frequent words such as stop words. Defaults to unbounded (0).

【利用elasticsearch的morelikethis功能实现基于内容的相关度推荐】min_term_freq这个参数的默认值是2,意思就是一个term必须在这个document里至少出现两次,否则会被忽略掉,如果你测试的话可能随便输入点东西然后每个词都没有出现过两次,可能会被这个参数的默认值而导致拒之门外,所以可以修改成1。

    推荐阅读