Java|Java ElasticSearch 进行词云统计

【Java|Java ElasticSearch 进行词云统计】利用ElasticSearch的分词和聚合功能来对文本中的关键词进行词云统计
本文主要针对微博上的新闻来进行分词和词频统计,最后生成词云。具体代码如下:

public List wordCloudCount(Class clazz,String keywords){ BoolQueryBuilder boolQuery = QueryBuilders.boolQuery(); boolQuery.must(QueryBuilders.queryStringQuery(keywords)); TermsAggregationBuilder builder = AggregationBuilders.terms("word_count").field("content").size(30); Document document = (Document) clazz.getAnnotation(Document.class); SearchQuery searchQuery = new NativeSearchQueryBuilder() .withIndices(document.indexName()) .withTypes(document.type()) .withQuery(boolQuery) .addAggregation(builder) .build(); Aggregations aggregation = elasticsearchTemplate.query(searchQuery, new ResultsExtractor() { @Override public Aggregations extract(SearchResponse searchResponse) { return searchResponse.getAggregations(); } }); StringTerms typeTerm = (StringTerms) aggregation.asMap().get("word_count"); List.Bucket> bucketList = typeTerm.getBuckets(); LinkedList> wordList = new LinkedList<>(); for (StringTerms.Bucket bucket1 : bucketList) { String type_name = bucket1.getKeyAsString(); wordList.add(type_name); } try { FileReader fReader = new FileReader("stopwords.txt"); BufferedReader bufferedReader = new BufferedReader(fReader); List> list = new ArrayList>(); String readline = ""; while ((readline=bufferedReader.readLine())!=null){ list.add(readline); } wordList.removeIf(list::contains); } catch (IOException e) { log.info("读取停用词失败"); } return wordList; }

采用ES 的tempalte引擎来进行分词和聚合,需要强调的是,分完词之后要对停用词进行过滤,即stopwords.txt中的停用词,最后返回关键词的频率。

    推荐阅读