Python|scrapy两种方法爬取网站小说 Python|scrapy

第一种，使用spider爬取首先选择一个初始的小说链接，例如小说第一章的链接https://www.zwdu.com/book/11029/2297440.html
我爬去的是这个网站，链接中的小说
首先，创建一个项目
【Python|scrapy两种方法爬取网站小说】scrapy startproject novel
创建spider
scrapy genspider spider https://www.zwdu.com/book/11029/2297440.html
然后开始分析网站，因为该网站小说的链接没有规律，所以我选择通过下一页链接爬取下一章节的内容，然后到了最后一个章节发现最后的下一页链接不包含.html，可以设置一个if语句，来结束爬取。
这是下一章节链接在的地方。可以通过xpath来提取链接

文章图片

这里是最后一章，链接不包括.html，可以用来作为停止爬取的条件

文章图片

这里是我爬取的源码
spider.py

# -*- coding: utf-8 -*- from scrapy import Request,Spider import jsonclass SpiderSpider(Spider): name = 'spider' #allowed_domains = ['www.zwdu.com'] start_urls = ['https://www.zwdu.com/book/26215/8210673.html']def parse(self, response): title = response.xpath("//h1/text()").extract_first() content = ''.join(response.xpath('//div[@id="content"]/text()').extract()).replace('', '\n') yield { 'title':title, 'content':content } next_url = response.xpath('//div[@class="bottem2"]/a[3]/@href').extract_first() base_url = 'https://www.81zw.com'+next_url if next_url.find('.html') != -1: yield Request(base_url,callback=self.parse)

pipelines.py

# -*- coding: utf-8 -*-# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.htmlclass XiaoshuoPipeline(object): def open_spider(self,spider): self.file = open('xiuzhen.txt','w',encoding='utf-8')def process_item(self, item, spider): title = item['title'] content = item['content'] info = title + '\n' + content + '\n' self.file.write(info) return itemdef close_spider(self,spider): self.file.close()

然后在settings中设置User-Agent，和ItemPipeline，执行scrapy crawl spider即可爬取这个小说。
第二种方法CrawlSpider 创建spider文件，scrapy genspider -t crawl example exampile.com
他使用Rule自动按规则提取链接进行爬取

文章图片

LinkExtractor（）有两个常用参数
allow：通过正则表达式来匹配
restrict_xpaths：通过xpath来匹配
因为crawlspider不会解析start_urls,所以需要从整个章节目录开始爬取，第一个Rule用来爬取第一章，第二个用来爬取接下来的每一章。解析方法和第一种方法一样，pipelines也不变。
spider2.py

# -*- coding: utf-8 -*- import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Ruleclass ZwwSpider(CrawlSpider): name = 'spider2' #allowed_domains = ['https://www.zwdu.com'] start_urls = ['https://www.zwdu.com/book/11029/']rules = ( Rule(LinkExtractor(restrict_xpaths=r'//*[@id="list"]/dl/dd[1]/a'), callback='parse_item', follow=True), Rule(LinkExtractor(restrict_xpaths=r'//div[@class="bottem1"]/a[3]'), callback='parse_item', follow=True), )def parse_item(self, response): title = response.xpath("//h1/text()").extract_first() content = ''.join(response.xpath('//div[@id="content"]/text()').extract()).replace('', '\n')yield { 'title': title, 'content': content }

pipelines.py

# -*- coding: utf-8 -*-# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.htmlclass XiaoshuoPipeline(object): def open_spider(self,spider): self.file = open('xiuzhen.txt','w',encoding='utf-8')def process_item(self, item, spider): title = item['title'] content = item['content'] info = title + '\n' + content + '\n' self.file.write(info) return itemdef close_spider(self,spider): self.file.close()

settings.py

# -*- coding: utf-8 -*-# Scrapy settings for xiaoshuo project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # #https://doc.scrapy.org/en/latest/topics/settings.html #https://doc.scrapy.org/en/latest/topics/downloader-middleware.html #https://doc.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'xiaoshuo'SPIDER_MODULES = ['xiaoshuo.spiders'] NEWSPIDER_MODULE = 'xiaoshuo.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'# Obey robots.txt rules ROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0) # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 2 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default) #COOKIES_ENABLED = False# Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False# Override the default request headers: #DEFAULT_REQUEST_HEADERS = { #'Accept': 'text/html,application/xhtml+xml,application/xml; q=0.9,*/*; q=0.8', #'Accept-Language': 'en', #}# Enable or disable spider middlewares # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { #'xiaoshuo.middlewares.XiaoshuoSpiderMiddleware': 543, #}# Enable or disable downloader middlewares # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html # DOWNLOADER_MIDDLEWARES = { #'xiaoshuo.middlewares.XiaoshuoDownloaderMiddleware': 543, # }# Enable or disable extensions # See https://doc.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { #'scrapy.extensions.telnet.TelnetConsole': None, #}# Configure item pipelines # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'xiaoshuo.pipelines.XiaoshuoPipeline': 300, }# Enable and configure the AutoThrottle extension (disabled by default) # See https://doc.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default) # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

以上就是两种爬取网站小说的方法，有不对的还请多多指教。