任务是计算最频繁的单词, 从而从动态来源中提取数据。
首先, 借助以下方法创建网络抓取工具要求模块和美丽的汤模块, 它将从网页中提取数据并将其存储在列表中。可能会有一些不需要的单词或符号(例如特殊符号, 空格), 可以对其进行过滤以简化计数并获得所需的结果。在对每个单词计数之后, 我们还可以对大多数(例如10或20个)常见单词进行计数。
使用的模块和库函数:
requests:将允许你发送HTTP/1.1请求以及更多请求。以下是上述想法的实现:
beautifulsoup4:用于从HTML和XML文件中提取数据。
operator:导出一组与内部运算符相对应的有效函数。
collections:实现高性能的容器数据类型。
# Python3 program for a word frequency
# counter after crawling a web-page
import requests
from bs4 import BeautifulSoup
import operator
from collections import Counter'''Function defining the web-crawler/core
spider, which will fetch information from
a given website, and push the contents to
the secondfunction clean_wordlist()'''
def start(url):# empty list to store the contents of
# the website fetched from our web-crawler
wordlist = []
source_code = requests.get(url).text# BeautifulSoup object which will
# ping the requested url for data
soup = BeautifulSoup(source_code, 'html.parser' )# Text in given web-page is stored under
# the <
div>
tags with class <
entry-content>
for each_text in soup.findAll( 'div' , { 'class' : 'entry-content' }):
content = each_text.text# use split() to break the sentence into
# words and convert them into lowercase
words = content.lower().split()for each_word in words:
wordlist.append(each_word)
clean_wordlist(wordlist)# Function removes any unwanted symbols
def clean_wordlist(wordlist):clean_list = []
for word in wordlist:
symbols = '!@#$%^&
*()_-+={[}]|\;
:"<
>
?/., 'for i in range ( 0 , len (symbols)):
word = word.replace(symbols[i], '')if len (word)>
0 :
clean_list.append(word)
create_dictionary(clean_list)# Creates a dictionary conatining each word's
# count and top_20 ocuuring words
def create_dictionary(clean_list):
word_count = {}for word in clean_list:
if word in word_count:
word_count[word] + = 1
else :
word_count[word] = 1''' To get the count of each word in
the crawled page -->
# operator.itemgetter() takes one
# parameter either 1(denotes keys)
# or 0 (denotes corresponding values)for key, value in sorted(word_count.items(), key = operator.itemgetter(1)):
print ("% s : % s " % (key, value))<
-- '''c = Counter(word_count)# returns the most occurring elements
top = c.most_common( 10 )
print (top)# Driver code
if __name__ = = '__main__' :
start( "https://www.srcmini.org/programming-language-choose/" )
[('to', 10), ('in', 7), ('is', 6), ('language', 6), ('the', 5), ('programming', 5), ('a', 5), ('c', 5), ('you', 5), ('of', 4)]
【使用Python程序爬取网页并获得最常用的单词】首先, 你的面试准备可通过以下方式增强你的数据结构概念:Python DS课程。
推荐阅读
- Python程序在给定字符串中使用集合来计算元音数
- 查找正方形和矩形的周长/周长的程序
- Python如何实现图像强度转换操作()
- Python程序使用OpenCV提取帧|视频操作
- 查找奇数次出现的数字的Python程序
- 最大和连续子数组的Python程序
- 数学建模|【建模算法】基于模拟退火算法求解TSP问题(Python实现)
- 数学建模|【建模算法】Python调用scikit-opt工具箱中的模拟退火算法求解TSP问题
- 算法|冲刺秋招!最全面的机器学习知识复习及巩固攻略