python|python selenium爬取kuku漫画
在爬取这个网站之前,试过爬取其他网站的漫画,但是发现有很多反爬虫的限制,有的图片后面加了动态参数,每秒都会更新,所以前一秒爬取的图片链接到一下秒就会失效了,还有的是图片地址不变,但是访问次数频繁的话会返回403,终于找到一个没有限制的漫画网站,演示一下selenium爬虫
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 | # -*- coding:utf-8 -*- # crawl kuku漫画 __author__
=
'fengzhankui' from
selenium
import
webdriver from
selenium.webdriver.common.desired_capabilities
import
DesiredCapabilities import
os import
urllib2 import
chrom class
getManhua(
object
):
def
__init__(
self
):
self
.num
=
5
self
.starturl
=
'http://comic.kukudm.com/comiclist/2154/51850/1.htm'
self
.browser
=
self
.getBrowser()
self
.getPic(
self
.browser)
def
getBrowser(
self
):
dcap
=
dict
(DesiredCapabilities.PHANTOMJS)
dcap[
"phantomjs.page.settings.userAgent"
]
=
(
"Mozilla/5.0 (Windows NT 6.1;
Win64;
x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36"
)
browser
=
webdriver.PhantomJS(desired_capabilities
=
dcap)
try
:
browser.get(
self
.starturl)
except
:
print
'open url fail'
browser.implicitly_wait(
20
)
return
browser
def
getPic(
self
,browser):
cartoonTitle
=
browser.title.split(
'_'
)[
0
]
self
.createDir(cartoonTitle)
os.chdir(cartoonTitle)
for
i
in
range
(
1
,
self
.num):
i
=
str
(i)
imgurl
=
browser.find_element_by_tag_name(
'img'
).get_attribute(
'src'
)
print
imgurl
with
open
(
'page'
+
i
+
'.jpg'
,
'wb'
) as fp:
agent
=
chrom.pcUserAgent.get(
'Firefox 4.0.1 - Windows'
)
request
=
urllib2.Request(imgurl)
request.add_header(agent.split(
':'
,
1
)[
0
],agent.split(
':'
,
1
)[
0
])
response
=
urllib2.urlopen(request)
fp.write(response.read())
print
'page'
+
i
+
'success'
NextTag
=
browser.find_elements_by_tag_name(
'a'
)[
-
1
].get_attribute(
'href'
)
browser.get(NextTag)
browser.implicitly_wait(
20
)
def
createDir(
self
,cartoonTitle):
if
os.path.exists(cartoonTitle):
print
'exists'
else
:
os.mkdir(cartoonTitle) if
__name__
=
=
'__main__'
:
getManhua() |
运行过程如图所示
本文转自 无心低语 51CTO博客,原文链接:http://blog.51cto.com/fengzhankui/1946775,如需转载请自行联系原作者
推荐阅读
- python学习之|python学习之 实现QQ自动发送消息
- 逻辑回归的理解与python示例
- python自定义封装带颜色的logging模块
- 【Leetcode/Python】001-Two|【Leetcode/Python】001-Two Sum
- Python基础|Python基础 - 练习1
- Python爬虫|Python爬虫 --- 1.4 正则表达式(re库)
- 使用协程爬取网页,计算网页数据大小
- Python(pathlib模块)
- python青少年编程比赛_第十一届蓝桥杯大赛青少年创意编程组比赛细则
- Python数据分析(一)(Matplotlib使用)