python|python selenium爬取kuku漫画

在爬取这个网站之前,试过爬取其他网站的漫画,但是发现有很多反爬虫的限制,有的图片后面加了动态参数,每秒都会更新,所以前一秒爬取的图片链接到一下秒就会失效了,还有的是图片地址不变,但是访问次数频繁的话会返回403,终于找到一个没有限制的漫画网站,演示一下selenium爬虫

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 # -*- coding:utf-8 -*- # crawl kuku漫画 __author__ = 'fengzhankui' from selenium import webdriver from selenium.webdriver.common.desired_capabilities import DesiredCapabilities import os import urllib2 import chrom class getManhua( object ): def __init__( self ): self .num = 5 self .starturl = 'http://comic.kukudm.com/comiclist/2154/51850/1.htm' self .browser = self .getBrowser() self .getPic( self .browser) def getBrowser( self ): dcap = dict (DesiredCapabilities.PHANTOMJS) dcap[ "phantomjs.page.settings.userAgent" ] = ( "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36" ) browser = webdriver.PhantomJS(desired_capabilities = dcap) try : browser.get( self .starturl) except : print 'open url fail' browser.implicitly_wait( 20 ) return browser def getPic( self ,browser): cartoonTitle = browser.title.split( '_' )[ 0 ] self .createDir(cartoonTitle) os.chdir(cartoonTitle) for i in range ( 1 , self .num): i = str (i) imgurl = browser.find_element_by_tag_name( 'img' ).get_attribute( 'src' ) print imgurl with open ( 'page' + i + '.jpg' , 'wb' ) as fp: agent = chrom.pcUserAgent.get( 'Firefox 4.0.1 - Windows' ) request = urllib2.Request(imgurl) request.add_header(agent.split( ':' , 1 )[ 0 ],agent.split( ':' , 1 )[ 0 ]) response = urllib2.urlopen(request) fp.write(response.read()) print 'page' + i + 'success' NextTag = browser.find_elements_by_tag_name( 'a' )[ - 1 ].get_attribute( 'href' ) browser.get(NextTag) browser.implicitly_wait( 20 ) def createDir( self ,cartoonTitle): if os.path.exists(cartoonTitle): print 'exists' else : os.mkdir(cartoonTitle) if __name__ = = '__main__' : getManhua()
【python|python selenium爬取kuku漫画】对了应对反爬虫的机制,我在selenium和urllib2分别加了请求参数,反正网站通过过滤请求的方式将爬虫过滤掉,在这里仅爬取了开始url往下的5页,而且为了防止图片和网络延时,设置20秒了等待时间,刚开始运行时间会稍微有点长,需要等待。
运行过程如图所示










本文转自 无心低语 51CTO博客,原文链接:http://blog.51cto.com/fengzhankui/1946775,如需转载请自行联系原作者

    推荐阅读