Python爬虫实战,pyecharts模块,Python爬取力宏评论区数据可视化
前言
利用Python爬取力宏评论区数据可视化,废话不多说~
让我们愉快地开始吧~
开发工具
Python版本: 3.6.4
相关模块:
requests模块;
【Python爬虫实战,pyecharts模块,Python爬取力宏评论区数据可视化】urllib3模块;
jieba模块;
pyecharts模块;
random模块;
numpy模块;
wordcloud模块;
以及一些Python自带的模块。
环境搭建
安装Python并添加到环境变量,pip安装需要的相关模块即可。
Python爬取评论区数据
# 爬取一页评论内容
def get_one_page(url):
headers = {
'User-agent' : 'Mozilla/5.0 (Windows NT 6.1;
Win64;
x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3880.4 Safari/537.36',
'Host' : 'weibo.cn',
'Accept' : 'application/json, text![]()ain, */*',
'Accept-Language' : 'zh-CN,zh;
q=0.9',
'Accept-Encoding' : 'gzip, deflate, br',
'Cookie' : '自己的Cookie',
'DNT' : '1',
'Connection' : 'keep-alive'
}
# 获取网页 html
response = requests.get(url, headers = headers, verify=False)
# 爬取成功
if response.status_code == 200:
# 返回值为 html 文档,传入到解析函数当中
return response.text
return None# 解析保存评论信息
def save_one_page(html):
comments = re.findall('(.*?)', html)
for comment in comments[1:]:
result = re.sub('<.*?>', '', comment)
if '回复@' not in result:
with open('comments.txt', 'a+', encoding='utf-8') as fp:\
fp.write(result)
TOP10 词汇
代码实现
stop_words = []
with open('stop_words.txt', 'r', encoding='utf-8') as f:
lines = f.readlines()
for line in lines:
stop_words.append(line.strip())
content = open('comments.txt', 'rb').read()
# jieba 分词
word_list = jieba.cut(content)
words = []
for word in word_list:
if word not in stop_words:
words.append(word)wordcount = {}
for word in words:
if word != ' ':
wordcount[word] = wordcount.get(word, 0)+1
wordtop = sorted(wordcount.items(), key=lambda x: x[1], reverse=True)[:10]
wx = []
wy = []
for w in wordtop:
wx.append(w[0])
wy.append(w[1])(
Bar(init_opts=opts.InitOpts(theme=ThemeType.MACARONS))
.add_xaxis(wx)
.add_yaxis('数量', wy)
.reversal_axis()
.set_global_opts(
title_opts=opts.TitleOpts(title='评论词 TOP10'),
yaxis_opts=opts.AxisOpts(name='词语'),
xaxis_opts=opts.AxisOpts(name='数量'),
)
.set_series_opts(label_opts=opts.LabelOpts(position='right'))
).render_notebook()
结果
文章图片
词云看看评论区,主要代码实现
def jieba_():
stop_words = []
with open('stop_words.txt', 'r', encoding='utf-8') as f:
lines = f.readlines()
for line in lines:
stop_words.append(line.strip())
content = open('comments.txt', 'rb').read()
# jieba 分词
word_list = jieba.cut(content)
words = []
for word in word_list:
if word not in stop_words:
words.append(word)
global word_cloud
# 用逗号隔开词语
word_cloud = ','.join(words)def cloud():
# 打开词云背景图
cloud_mask = np.array(Image.open('bg.png'))
# 定义词云的一些属性
wc = WordCloud(
# 背景图分割颜色为白色
background_color='white',
# 背景图样
mask=cloud_mask,
# 显示最大词数
max_words=200,
# 显示中文
font_path='./fonts/simhei.ttf',
# 最大尺寸
max_font_size=100
)
global word_cloud
# 词云函数
x = wc.generate(word_cloud)
# 生成词云图片
image = x.to_image()
# 展示词云图片
image.show()
# 保存词云图片
wc.to_file('melon.png')
本文源码详见个人主页简介获取
效果展示
文章图片
推荐阅读
- python学习之|python学习之 实现QQ自动发送消息
- 逻辑回归的理解与python示例
- python自定义封装带颜色的logging模块
- 【Leetcode/Python】001-Two|【Leetcode/Python】001-Two Sum
- Python基础|Python基础 - 练习1
- Python爬虫|Python爬虫 --- 1.4 正则表达式(re库)
- Python(pathlib模块)
- python青少年编程比赛_第十一届蓝桥杯大赛青少年创意编程组比赛细则
- Python数据分析(一)(Matplotlib使用)
- 爬虫数据处理HTML转义字符