python|python beautifulsoup爬虫_python爬虫数据解析之BeautifulSoup python|beautifulsoup爬虫

BeautifulSoup是一个可以从HTML或者XML文件中提取数据的python库。它能够通过你喜欢的转换器实现惯用的文档导航，查找，修改文档的方式。
BeautfulSoup是python爬虫三大解析方法之一。
首先来看个例子:
from bs4 importBeautifulSoup
html_doc= """
The Dormouse's story The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
"""soup= BeautifulSoup(html_doc, 'lxml')print(soup.prettify()) 这个beautiful对象可以按照标准的缩进结构输出。
The Dormouse's story The Dormouse's story
Once upon a time there were three little sisters; andtheir names wereElsie,Lacie
and
Tillie; andthey lived at the bottom of a well.
【python|python beautifulsoup爬虫_python爬虫数据解析之BeautifulSoup】...
接下来用以上例子来使用beautifulsoup
#soup.title 找到第一个title标签，其他也一样
print(soup.title)

文章图片

#soup.title.name 找到title标签的名字
print(soup.title.name)

文章图片

#找到第一个p标签的class属性
print(soup.p['class'])

文章图片

#找到第一个p标签的属性和属性值,以字典形式呈现
print(soup.p.attrs)

文章图片

其属性可以修改和删除，操作方法和字典一样.
#找到title标签的内容,3中方式
print(soup.title.string)

文章图片

#找到title标签的内容,并替换
soup.title.string.replace_with("No longer bold")print(soup.title.string)

文章图片

#输出head的子标签
print(soup.head.contents)

文章图片

#输出head的第一个子标签
print(soup.head.contents[0])

文章图片

#输出head的子标签
print(soup.head.children)

文章图片

返回一个列表对象，用来做迭代.
#输出head的所有子孙节点
print(soup.head.descendants)

文章图片

如果tag包含了多个子节点,tag就无法确定 .string 方法应该调用哪个子节点的内容, .string 的输出结果是 None
如果tag中包含多个字符串,可以使用 .strings 来循环获取:
输出的字符串中可能包含了很多空格或空行,使用 .stripped_strings 可以去除多余空白内容:
#找到title标签的父标签
print(soup.title.parent)

文章图片

#找到第一个a标签
link =soup.a#找打a节点的所有父节点
for parent inlink.parents:print(parent.name)

文章图片

在文档树中,使用 .next_sibling 和 .previous_sibling 属性来查询兄弟节点:
find()
find(name, attrs, recursive, text, **wargs)# recursive 递归的，循环的
#找到第一个a节点
print(soup.find('a'))

文章图片

#找文本为Elsie的第一个a节点
print(soup.find('a', text='Elsie'))

文章图片

#通过正则表达式查找第一个包含字符a的标签
print(soup.find(re.compile('a')))

文章图片

#找到一个包含id='link3'标签
print(soup.find(id="link3"))

文章图片

#找到一个包含id='link3'标签
print(soup.find(attrs={'id':'link3'}))

文章图片

class是python的保留关键字，所以无法使用class这个关键字。
有2种方法：
第一种：soup.find(attrs={'class':'haha'})
第二种:soup.find(class_='haha')
#定义函数查找
defsearch_one(tag):return tag.has_attr('id') and tag.get('id')=='link3'oder=soup.find(search_one)print(oder)

文章图片

find_all()
find_all(name, attrs, recursive, text, limit, **kwargs)
调用tag的 find_all() 方法时,Beautiful Soup会检索当前tag的所有子孙节点,如果只想搜索tag的直接子节点,可以使用参数 recursive=False .
#找到所有的a标签,返回的是列表
print(soup.find_all('a'))
[Elsie, Lacie, Tillie]
#找到所有的a标签中的第一个
print(soup.find_all('a', limit=1))

文章图片

#找到所有包含a字符的标签，返回一个列表
print(soup.find_all(re.compile('a')))
[
The Dormouse's story, Elsie, Lacie, Tillie] #找到所有的a标签和b标签
print(soup.find_all(['a', 'b']))
[The Dormouse's story, Elsie, Lacie, Tillie]
#找到所有的有属性值为sister的a标签
print(soup.find_all("a", "sister"))
[Elsie, Lacie, Tillie]
#找到所有的有id属性值为link1的a标签
print(soup.find_all(id='link1'))

文章图片

class是python的保留关键字，所以无法使用class这个关键字。
有2种方法：
第一种：soup.find_all(attrs={'class':'haha'})
第二种:soup.find_all(class_='haha')
#找到所有的内容为Elsie的内容
print(soup.find_all(text='Elsie'))print(soup.find_all(text=['Elsie', 'Lacie']))

文章图片

CSS选择器
Beautiful Soup支持大部分的CSS选择器 [6] ,在 Tag 或 BeautifulSoup 对象的 .select() 方法中传入字符串参数,即可使用CSS选择器的语法找到tag
常见的选择器：标签选择器(a)、类选择器(.)、id选择器(#)、层级选择器
div .dudu #lala .meme .xixi 下面好多级
div > p > a > .lala 只能是下面一级
#找到select标签
print(soup.select('title'))

文章图片

#找到body下的a
print(soup.select('body a'))
[Elsie, Lacie, Tillie]
找到某个tag标签下的直接子标签
#找到head下的title
print(soup.select('head > title'))

文章图片

#找到p下的a
print(soup.select('p > a'))
[Elsie, Lacie, Tillie]
#找到p下的id选择器link1
print(soup.select('p > #link1'))

文章图片

找到第一个有id选择器link1第二个有类选择器sisterprint(soup.select('#link1 ~ .sister'))

文章图片

#找到有id选择器link1和类选择器sister
print(soup.select('#link1 + .sister'))

文章图片

#通过类名查找
print(soup.select('.sister'))print(soup.select("[class~=sister]"))
[Elsie, Lacie, Tillie]
#通过id查找
print(soup.select('a#link1'))

文章图片

#通过属性查找
print(soup.select('a[id]'))
[Elsie, Lacie, Tillie]
#通过属性查找
print(soup.select('a[href="http://example.com/elsie"]'))#只要包含一部分属性就行
print(soup.select('a[href*="m/el"]'))#属性的结尾
print(soup.select('a[href$="elsie"]'))

文章图片

例子
1 爬取诗词名句网的水浒传并保存到本地网址：http://www.shicimingju.com/book/shuihuzhuan.html
importrequestsfrom bs4 importBeautifulSoupimportos#解析出所有的目录
defmain(url, headers):
response= requests.get(url=url, headers=headers).text
soup= BeautifulSoup(response, 'lxml')
titles= soup.find_all(attrs=['class', 'book-mulu'])for one_title intitles:#得到一个一个a标签的列表
mulu_lst = one_title.find_all('a')
write_in(mulu_lst)#写入文件
defwrite_in(mulu):#创建文件夹
if not os.path.exists('shuihu'):
os.mkdir('shuihu')#解析每一章节
for one_mulu inmulu:#拿到新的页面
text = requests.get(url='http://www.shicimingju.com'+one_mulu['href'], headers=ua_headers).text
soup1= BeautifulSoup(text, 'lxml')
content= soup1.find_all(attrs=['class', 'chapter_content'])#解析这个div
for one_content incontent:#去出2遍空格
con =one_content.get_text().strip()#将每段空格换为换行符
con1 = con.replace('', '\n')#以章节为名写入文件夹
with open('shuihu/'+ one_mulu.string + '.txt', 'w', encoding='utf-8') as f:
f.write(con1)if __name__ == '__main__':
url= 'http://www.shicimingju.com/book/shuihuzhuan.html'ua_headers= {"User-Agent":"Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko)"
"Version/5.1 Safari/534.50"}
main(url, ua_headers)
得到本地书籍:可以拖到手机上去看

文章图片

文章图片

其实上段代码的beautifulsoup可以简化的，但是我为了练习find_all就写了以上代码，可以用其他方式简化的.
2 爬取抽屉网首页的新闻标题和连接https://dig.chouti.com/
importrequestsfrom bs4 importBeautifulSoupdefmain():
response= requests.get(url=url, headers=ua_headers).text
soup= BeautifulSoup(response, 'lxml')
titles= soup.find_all('a', class_="show-content color-chag")for one_title intitles:print('标题' + one_title.text.strip() + '的链接为:' + one_title['href'])if __name__ == '__main__':
url= 'https://dig.chouti.com/'ua_headers= { "User-Agent":'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)'}
main()
得到输出:

文章图片

3 爬取58同城的房源信息深圳二手房，并保存到xls文件
https://sz.58.com/ershoufang/?utm_source=market&spm=u-2d2yxv86y3v43nkddh1.BDPCPZ_BT&PGTID=0d30000c-0000-4591-0324-370565eccba8&ClickID=1
importrequestsfrom bs4 importBeautifulSoupimportxlwt#得到soup对象
defmain(url,headers):
response= requests.get(url=url, headers=headers).text
soup= BeautifulSoup(response, 'lxml')
parse(soup)#解析出数据,写入xls
defparse(soup):#找到带有类选择器house-list-wrap下面的所有li标签
li_list = soup.select('ul.house-list-wrap > li')#创建xls文件
wookbook = xlwt.Workbook(encoding='utf-8')
sheet= wookbook.add_sheet('58 sheet')
title_lst= ['name', 'total_price', 'one_price', 'huxing', 'area', 'direction', 'height']
hang, lie=0, 0#把第零行的标题写进去
for one intitle_lst:
sheet.write(0, lie, one)
lie+= 1
#写入每一行
hang += 1
for one_li inli_list:
sheet.write(hang, 0, one_li.select('h2 > a')[0].text.strip())
sheet.write(hang,1, one_li.select('div.price > p.sum')[0].text)
sheet.write(hang,2, one_li.select('div.price > p.unit')[0].text)
sheet.write(hang,3, one_li.select('div.list-info > p.baseinfo > span')[0].text)
sheet.write(hang,4, one_li.select('div.list-info > p.baseinfo > span')[1].text.strip())
sheet.write(hang,5, one_li.select('div.list-info > p.baseinfo > span')[2].text)
sheet.write(hang,6, one_li.select('div.list-info > p.baseinfo > span')[3].text)
hang+= 1wookbook.save('58深圳二手房子.xls')if __name__ == '__main__':
url= 'https://sz.58.com/ershoufang/?utm_source=market&spm=u-2d2yxv86y3v43nkddh1.'\'BDPCPZ_BT&PGTID=0d30000c-0000-4591-0324-370565eccba8&ClickID=1'ua_headers= {"User-Agent": "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML,"
"like Gecko) Version/5.1 Safari/534.50"}
main(url, ua_headers)
得到xls文件.

文章图片

python|python beautifulsoup爬虫_python爬虫数据解析之BeautifulSoup

推荐阅读

松下空调怎么清洁松下空调如何清洁

processon怎么做思维导图 ProcessOn制作思维导图的方法步骤

单反相机镜头盖子拧不开尼康相机镜头盖打不开

收到别人行贿送的钱怎么处理收了别人的贿赂怎么办

叶诗文200米混合泳夺冠央视解说叶诗文伦敦奥运会200米混合泳决赛

酒药以酒曲的制作方法

英德大湾镇英德哪里的

盆栽常青藤如何养殖？盆栽常青藤对土壤的要求

手机怎样消毒最好方法

2分钟了解茶艺知识茶艺知识

松下fz100

lnd-al20是什么型号（lnd-al20是什么手机）

容声冰箱显示屏不亮,正确答案来了

多古拉斯是什么面料

北美票房不足300万美元《哪吒之魔童降世》上映为何遇冷？

为什么走进烈士陵园不害怕,但是去一般的公墓都会觉得阴森？

菲斯曼锅炉模板显示怎么回事,那是你没这样来处理下

精灵宝可梦剑盾极巨化宝可梦剑盾极巨化打不过怎么办

知道这一松手就是万丈深渊，乌克兰下一次还敢“误射”

方向机漏油会造成哪些后果? 方向机漏油会有什么影响