Java|Java ElasticSearch 进行词云统计
【Java|Java ElasticSearch 进行词云统计】利用ElasticSearch的分词和聚合功能来对文本中的关键词进行词云统计
本文主要针对微博上的新闻来进行分词和词频统计,最后生成词云。具体代码如下:
public List wordCloudCount(Class clazz,String keywords){
BoolQueryBuilder boolQuery = QueryBuilders.boolQuery();
boolQuery.must(QueryBuilders.queryStringQuery(keywords));
TermsAggregationBuilder builder = AggregationBuilders.terms("word_count").field("content").size(30);
Document document = (Document) clazz.getAnnotation(Document.class);
SearchQuery searchQuery = new NativeSearchQueryBuilder()
.withIndices(document.indexName())
.withTypes(document.type())
.withQuery(boolQuery)
.addAggregation(builder)
.build();
Aggregations aggregation = elasticsearchTemplate.query(searchQuery, new ResultsExtractor() {
@Override
public Aggregations extract(SearchResponse searchResponse) {
return searchResponse.getAggregations();
}
});
StringTerms typeTerm = (StringTerms) aggregation.asMap().get("word_count");
List.Bucket> bucketList = typeTerm.getBuckets();
LinkedList> wordList = new LinkedList<>();
for (StringTerms.Bucket bucket1 : bucketList) {
String type_name = bucket1.getKeyAsString();
wordList.add(type_name);
}
try {
FileReader fReader = new FileReader("stopwords.txt");
BufferedReader bufferedReader = new BufferedReader(fReader);
List> list = new ArrayList>();
String readline = "";
while ((readline=bufferedReader.readLine())!=null){
list.add(readline);
}
wordList.removeIf(list::contains);
} catch (IOException e) {
log.info("读取停用词失败");
}
return wordList;
}
采用ES 的tempalte引擎来进行分词和聚合,需要强调的是,分完词之后要对停用词进行过滤,即stopwords.txt中的停用词,最后返回关键词的频率。
推荐阅读
- JAVA(抽象类与接口的区别&重载与重写&内存泄漏)
- 事件代理
- Java|Java OpenCV图像处理之SIFT角点检测详解
- java中如何实现重建二叉树
- 数组常用方法一
- 【Hadoop踩雷】Mac下安装Hadoop3以及Java版本问题
- Java|Java基础——数组
- RxJava|RxJava 在Android项目中的使用(一)
- java之static、static|java之static、static final、final的区别与应用
- Java基础-高级特性-枚举实现状态机