使用Python程序爬取网页并获得最常用的单词

任务是计算最频繁的单词, 从而从动态来源中提取数据。
首先, 借助以下方法创建网络抓取工具要求模块和美丽的汤模块, 它将从网页中提取数据并将其存储在列表中。可能会有一些不需要的单词或符号(例如特殊符号, 空格), 可以对其进行过滤以简化计数并获得所需的结果。在对每个单词计数之后, 我们还可以对大多数(例如10或20个)常见单词进行计数。
使用的模块和库函数:

requests:将允许你发送HTTP/1.1请求以及更多请求。
beautifulsoup4:用于从HTML和XML文件中提取数据。
operator:导出一组与内部运算符相对应的有效函数。
collections:实现高性能的容器数据类型。
以下是上述想法的实现:
# Python3 program for a word frequency # counter after crawling a web-page import requests from bs4 import BeautifulSoup import operator from collections import Counter'''Function defining the web-crawler/core spider, which will fetch information from a given website, and push the contents to the secondfunction clean_wordlist()''' def start(url):# empty list to store the contents of # the website fetched from our web-crawler wordlist = [] source_code = requests.get(url).text# BeautifulSoup object which will # ping the requested url for data soup = BeautifulSoup(source_code, 'html.parser' )# Text in given web-page is stored under # the < div> tags with class < entry-content> for each_text in soup.findAll( 'div' , { 'class' : 'entry-content' }): content = each_text.text# use split() to break the sentence into # words and convert them into lowercase words = content.lower().split()for each_word in words: wordlist.append(each_word) clean_wordlist(wordlist)# Function removes any unwanted symbols def clean_wordlist(wordlist):clean_list = [] for word in wordlist: symbols = '!@#$%^& *()_-+={[}]|\; :"< > ?/., 'for i in range ( 0 , len (symbols)): word = word.replace(symbols[i], '')if len (word)> 0 : clean_list.append(word) create_dictionary(clean_list)# Creates a dictionary conatining each word's # count and top_20 ocuuring words def create_dictionary(clean_list): word_count = {}for word in clean_list: if word in word_count: word_count[word] + = 1 else : word_count[word] = 1''' To get the count of each word in the crawled page --> # operator.itemgetter() takes one # parameter either 1(denotes keys) # or 0 (denotes corresponding values)for key, value in sorted(word_count.items(), key = operator.itemgetter(1)): print ("% s : % s " % (key, value))< -- '''c = Counter(word_count)# returns the most occurring elements top = c.most_common( 10 ) print (top)# Driver code if __name__ = = '__main__' : start( "https://www.srcmini.org/programming-language-choose/" )

[('to', 10), ('in', 7), ('is', 6), ('language', 6), ('the', 5), ('programming', 5), ('a', 5), ('c', 5), ('you', 5), ('of', 4)]

【使用Python程序爬取网页并获得最常用的单词】首先, 你的面试准备可通过以下方式增强你的数据结构概念:Python DS课程。

    推荐阅读