Python爬虫实战,pyecharts模块,Python爬取力宏评论区数据可视化

前言 利用Python爬取力宏评论区数据可视化,废话不多说~
让我们愉快地开始吧~
开发工具 Python版本: 3.6.4
相关模块:
requests模块;
【Python爬虫实战,pyecharts模块,Python爬取力宏评论区数据可视化】urllib3模块;
jieba模块;
pyecharts模块;
random模块;
numpy模块;
wordcloud模块;
以及一些Python自带的模块。
环境搭建 安装Python并添加到环境变量,pip安装需要的相关模块即可。
Python爬取评论区数据

# 爬取一页评论内容 def get_one_page(url): headers = { 'User-agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3880.4 Safari/537.36', 'Host' : 'weibo.cn', 'Accept' : 'application/json, text![]()ain, */*', 'Accept-Language' : 'zh-CN,zh; q=0.9', 'Accept-Encoding' : 'gzip, deflate, br', 'Cookie' : '自己的Cookie', 'DNT' : '1', 'Connection' : 'keep-alive' } # 获取网页 html response = requests.get(url, headers = headers, verify=False) # 爬取成功 if response.status_code == 200: # 返回值为 html 文档,传入到解析函数当中 return response.text return None# 解析保存评论信息 def save_one_page(html): comments = re.findall('(.*?)', html) for comment in comments[1:]: result = re.sub('<.*?>', '', comment) if '回复@' not in result: with open('comments.txt', 'a+', encoding='utf-8') as fp:\ fp.write(result)

TOP10 词汇
代码实现
stop_words = [] with open('stop_words.txt', 'r', encoding='utf-8') as f: lines = f.readlines() for line in lines: stop_words.append(line.strip()) content = open('comments.txt', 'rb').read() # jieba 分词 word_list = jieba.cut(content) words = [] for word in word_list: if word not in stop_words: words.append(word)wordcount = {} for word in words: if word != ' ': wordcount[word] = wordcount.get(word, 0)+1 wordtop = sorted(wordcount.items(), key=lambda x: x[1], reverse=True)[:10] wx = [] wy = [] for w in wordtop: wx.append(w[0]) wy.append(w[1])( Bar(init_opts=opts.InitOpts(theme=ThemeType.MACARONS)) .add_xaxis(wx) .add_yaxis('数量', wy) .reversal_axis() .set_global_opts( title_opts=opts.TitleOpts(title='评论词 TOP10'), yaxis_opts=opts.AxisOpts(name='词语'), xaxis_opts=opts.AxisOpts(name='数量'), ) .set_series_opts(label_opts=opts.LabelOpts(position='right')) ).render_notebook()

结果
Python爬虫实战,pyecharts模块,Python爬取力宏评论区数据可视化
文章图片

词云看看评论区,主要代码实现
def jieba_(): stop_words = [] with open('stop_words.txt', 'r', encoding='utf-8') as f: lines = f.readlines() for line in lines: stop_words.append(line.strip()) content = open('comments.txt', 'rb').read() # jieba 分词 word_list = jieba.cut(content) words = [] for word in word_list: if word not in stop_words: words.append(word) global word_cloud # 用逗号隔开词语 word_cloud = ','.join(words)def cloud(): # 打开词云背景图 cloud_mask = np.array(Image.open('bg.png')) # 定义词云的一些属性 wc = WordCloud( # 背景图分割颜色为白色 background_color='white', # 背景图样 mask=cloud_mask, # 显示最大词数 max_words=200, # 显示中文 font_path='./fonts/simhei.ttf', # 最大尺寸 max_font_size=100 ) global word_cloud # 词云函数 x = wc.generate(word_cloud) # 生成词云图片 image = x.to_image() # 展示词云图片 image.show() # 保存词云图片 wc.to_file('melon.png')

本文源码详见个人主页简介获取
效果展示
Python爬虫实战,pyecharts模块,Python爬取力宏评论区数据可视化
文章图片

    推荐阅读